A Gentle Introduction to a Distributed Streaming Platform, Kafka

9 min readJul 3, 2020

Recently, I was exposed to Kafka at my workplace. I have heard of it a while ago, knowing it is a very popular distributed streaming platform, but I never had a chance to learn it. With my little experience in the traditional messaging system, RabbitMQ, I somehow tend to think of Kafka is just yet another variation created by LinkedIn.

If you are like me, having a little background knowledge of a message queue, and hoping to learn Kafka, this article is for you. I will first discuss why this technology is created, walkthrough some basic concepts of Kafka, and finally talk about how it differs from other traditional messaging systems.

The Problem

Before diving into Kafka, let’s talk about why a messaging system exists in the first place. What problem are they trying to solve?

👎 Scalability

In modern software architecture, engineers are obsessed with the idea of microservice, breaking down a giant monolithic service into many self-contained, independent, and loosely-coupled tiny services. The communication between these microservices is traditionally done by REST API calls. This is fine until one day, a heavily used microservice gets burned out.

As we can see from the diagram above, service B, C, D all rely on service A. This means service A has to handle an overwhelming number of requests during high traffic hours. Putting too much pressure on it can lead to the server crash.

As microservices grow in number, the communication between them increases too. It is common to send a large number of API requests from different microservices to the same destination microservice, which can cause the reusable microservice to burn out. Having that said, direct communication between microservices is not a scalable approach.

❌ Message Durability

When API requests and responses between services are communicated over the network, there is no guarantee that they are always fulfilled. Because servers are typically stateless, once messages are sent, they are gone. In case of network failure or server downtime, messages can get lost. Even if the server will eventually restart, it never knows about how many messages are lost during downtime, thus not able to reprocess them.

As the diagram is shown above, service A is down because of too many requests. When it restarts and brings itself back online, service A is not able to catch up with the missing messages because they are just gone.

Synchronous Communication

When a source service sends an API request to a destination service, the source service will eventually get a response back. Even though the network operation is asynchronous, meaning the source service can perform other operations during the wait time for a response, the communication is considered as synchronous because the source service needs to wait for the message to be processed by the destination service.

Coupled Architecture

When services directly interact with each other over the network, they are tightly coupled. When a source service sends an API request, it waits for a response. The length of the wait time is highly dependent on the business logic happening in the destination service. Imagine the destination service needs to perform a complex computation which can take a few minutes, the source service has to wait for that amount of time to get a response.

The Solution — Message Broker

A message broker is just an intermediary program (a middleware) that deals with incoming messages from producers and outgoing messages to consumers.

It can be responsible for the following tasks (not necessarily all of them):

Store messages
Route messages
Monitor messages
Manage messages

Before getting into details, let’s make sure we understand some basic terminologies:

Producer/Publisher: a service/application that produces/publishes messages to a broker (can be thought as a source service in the traditional communication model).
Consumer/Subscriber: a service/application that consumes/subscribes messages from a broker (can be thought as a destination service in the traditional communication model).

Now, let’s see how a message broker addresses problems in direct communication between services.

👍 Scalability

By introducing a middleman, message broker, we can easily achieve scalability. Brokers are designed to be scaled both vertically and horizontally. No more pressure on our highly reusable service.

The diagram above shows Service B, C, D are talking to brokers instead of directly communicating with service A. The producers are still producing the same number of messages, but this time, the consumer can handle messages at its own comfortable pace.

✅ Message Durability

With a message broker, messages are never lost because they will be persisted by the broker either in memory or disk until consumers acknowledge they have been processed. Some message brokers even allow messages to be persisted for a configurable amount of time. That means messages can be kept around forever if we want to.

In the above diagram, destination service A is down, but source services B, C, D are still operating and producing more messages. Fortunately, our broker in the middle will make sure unread messages are persisted.

Asynchronous Communication

Because a source service now only communicates with a broker, it does not have to wait for a destination service to finish processing messages and send back a response. This enables truly asynchronous communication.

Decoupled Architecture

When source services and destination services no longer depend on each other, we end up having a loosely coupled architecture.

Kafka

Finally, we are here. It’s time to talk about Kafka. I apologize it took me so long to explain the problem and solution, but it is worth knowing the background of why a particular technology emerges.

Created and open-sourced by LinkedIn, Kafka is a scalable, fault-tolerant, and distributed streaming platform.

Pros

Kafka excels at the following areas:

Excellent horizontal scalability: Kafka can easily scale to 100 message brokers and millions of messages per second.
High performance: latency could be less than 10ms, which can be considered as real-time.
Highly trusted: the technology is trusted by many top tech companies such as LinkedIn, Airbnb, and Netflix.

Applications

Although Kafka can be used in a lot of different ways, there are two major categories of application:

Transfer data: move data between microservices.
Transform data: process an input stream and produce an output stream.

We have just learned what Kafka really is, what benefits it provides, and how it can be used in the real world. Now, let’s dive into its core concepts.

Topic

A topic is an ordered collection of events (a stream of records) stored in a durable way (data is written to disks) for a configurable amount of time (could be days, months, or years).

There is no limit on how big a topic should be. It could be small or enormous, depending on the amount of data.

Consumers can subscribe to different topics for different categories of data.

Partition

In a nutshell, a partition is just a subset of data on a topic.

Why is a topic divided into multiple partitions?

Because a server has a limit on disk size, partition allows a topic to grow infinite in size, theoretically storing an infinite amount of data. Imagine a topic that has terabyte size of data that cannot fit into one server. This can be solved by breaking that topic into multiple subsets and store them on different servers.
It allows consumers to access a topic in parallel.

To ensure high availability, each partition is also replicated for fault tolerance in case of system failure.

While more partitions give us higher throughput, more parallel read/write units, and increased storage limit, they can also cost us more open file handles (each partition maps to a file directory and requires one file handle), more memory in the client(the producer buffers messages per partition), and higher end-to-end message delivery latency (replicating data across all in-sync replicas and committing messages now takes longer).

Traditional Messaging Systems vs Kafka

You probably have noticed that I describe Kafka as a distributed streaming platform, but how is it different from traditional messaging systems such as RabbitMQ?

I will compare the two attribute by attribute.

Payload (Message vs Stream)

What is moving around in the system?

In a traditional messaging system, a single message, or a piece of data, is considered as payload. We can think of it as a drop of water.

In Kafka, a payload can be a stream or a continuous flow of data over the network. A good analogy would be a flow of water.

Core (Queue vs Commit Log)

What is the core idea behind each system?

A traditional messaging system leverages a queue to deal with incoming and outgoing messages. Messages are kept in the queue until they are processed by consumers. Each message is processed only once by a single consumer.

At its core, Kafka is a big distributed and replicated commit log. It is worth noting that the log here has nothing to do with an application log. Instead, it is similar to a transaction log maintained by DBMS. A commit log is actually an append-only, totally-ordered (typically by time) data structure. Once messages are written to the log, they cannot be modified nor deleted by consumers. This log is important because it records the entire history of what happened to the system and when.

Read Receipt (Acknowledgement Message vs Cursor)

How does a message broker know if a message is successfully delivered and read by a consumer?

In a traditional messaging system, once a consumer successfully receives a message, it will send an acknowledgment message to the broker.

In Kafka, a consumer does not need to send anything to the broker. Instead, the consumer is responsible for keeping track of a cursor of its last processed message position. This also allows going back in time to query old messages and giving full ability to read the complete history.

Delivery Mechanism (One vs Many)

How is a message delivered?

A broker in a traditional messaging system can only deliver a message to only one consumer.

A Kafka broker can send a message to all consumers.

Durability (In-Memory vs Disk)

Messages in both systems are durable to ensure consumers can read unprocessed messages after they restart from unexpected server shutdown. However, because messages are stored differently, data retention time is different as well.

In a traditional messaging system, messages are stored in memory and they are deleted from the queue upon successful delivery.

In Kafka, messages are persisted in disk for a configurable amount of time. They can be kept forever if wanted.

Conclusion

As a microservice-oriented and event-driven software architecture becomes increasingly more popular, a new technology to ensure reliable and scalable communication between services is needed. To solve this problem, the emergence of message brokers allows us to further decouple our architecture, removing dependencies between services, and improving scalability. Kafka, being one of the most popular distributed streaming platforms, can be used as a single source of truth for all of our application events.

I hope you enjoyed reading this article! If you like it and learned something new, give it some claps 👏👏👏 Really appreciated 🙏🙏🙏