Kafka is a distributed streaming platform which allows its users to send and receive live messages containing a bunch of data.
This article will dwell on the architecture of Kafka, which is pivotal to understand how to properly set your streaming analysis environment. Later on, I will provide an example of real-time data analysis by creating an instant messaging environment with Kafka. By the end of this article, you will be able to understand the principal features which make Kafka so useful, that are:
- commit log
- horizontal scalability
- fault tolerance
For now, let’s keep them in mind without explaining their meaning, which will be far clearer after introducing some further notions.
So let’s begin with Kafka architecture. The main idea is that the sender will send its messages to the Kafka server, and the receiver will ask the Kafka server to show him the messages he’s interested in. Then Zookeeper, a software released by Apache, keeps track of the status of the Kafka server and manages all the information and configurations of the messages. Of course, the architecture is far more complex, but the underneath skeleton is that.
Now let’s start diving deeper into Kafka’s structure, by initializing some terminology and “roles”:
- Producer: it is the application which sends the message
- Message: it is read by Kafka as an array of bytes
- Consumer: it is the application which receives the message
- Topic: a unique name for Kafka stream
So far, we can already have a more concrete idea of how it works:
As said before, Kafka continuously receives requests for receiving and sending messages. The way these orders are processed and these data are transferred is through the so-called Kafka broker, which are coordinated and managed by Zookeeper. The broker is nothing more than a Kafka server, it is a meaningful name to refer to it. The notion of broker leads directly to that of a cluster. Indeed, brokers run on a Kafka cluster, which is a group of machines, each executing one instance of Kafka broker. Let’s append these two further notions to our list:
- Broker: a Kafka server running on a cluster. It manages the messages received and send them to the consumer
- Cluster: a group of machines, each executing one instance of Kafka broker
Notice that the concept of a cluster of machines explains the meaning of distribution. Indeed, a distributed system is one which is split into multiple running machines (each of them executing a broker) working together on a cluster.
Sometimes, the amount of data sent by the producer is huge and heavy, and a single computer in a cluster might not be able to store and process it. That’s why Kafka offers the possibility to split our data into smaller pieces, called partitions. Each partition has a unique ID. The number and size of partitions are set by the user while creating a topic, then Kafka will follow these instructions and split your data accordingly.
Note that partitions are based on a persistent ordered data structure which only supports appends: you cannot delete nor modify messages which have already been sent to the broker, you can only add further messages to them. This data structure is called commit log.
Thus, each partition will contain a sequence of messages which need to be identified, and this is what an offset is supposed to do. It is an array of numbers which identifies each message of a partition, in the order they are sent (so the first message will have offset=0). It is important to notice that offsets are local, not global: each partition will have a message identified with an offset equal to 0, but of course these are not the same data! That’s why each message has a specific path of identification: topic name, partition number (ID), offset. To sum up:
- Partition: one of the “fragments” in which data can be split by Kafka brokers
- Offset: a sequence of ID numbers which identifies, locally, the order of messages in each partition
In this previous picture, I assumed a cluster with only one machine and, consequently, one broker. However, when brokers are many, another feature of Kafka ecosystem needs to be introduced. I’m talking about the concept of replication: indeed, Kafka is able to store topic’s partitions across a configurable number of servers (or brokers), so that each partition is replicated on other brokers. Hence, each broker holds several partitions and each of those can be either a leader or a replica for a topic. For each partition, the broker entitled to manage it is the one storing its ‘leader’ version. However, if this broker’s machine fails, the processing of this partition will be continued on one of its replica, which will become the new leader.
Let’s visualize it to make it clearer:
Here we have three brokers and three partitions with a replication factor of 3 (for each partition we have three copies). If all three brokers do not crash, broker 1 will process only data stored in partition 0, since it is the ‘leading’ version, while its replicas will be synchronized to it. Similarly, broker 2 and 3 will deal with, respectively, partition 1 and 2. This way of processing data is based on the concept of horizontal scalability, that means, solving the same problem by throwing more machines at it.
Now imagine the machine which is executing broker 1 crash. What happens is the following:
Now partition 0 of broker 2 is the new leader, so broker 2 will process both partition 0 and 1. This concept helps us to explain fault tolerance: if a partition has n copies, it can be processed even if n-1 faults occur.
The final concept I want to cover is related to the consumer side. Let’s image we have only one consumer and a huge amount of data to be sent to him: he might not be able to process all of those data. We’d like our data to be processed in parallel, but how can we implement it? The answer is by creating a consumer group. It is a single receiving application composed by several consumers, each reading from one or more partition (but the same partition cannot be read from two different consumers: there are no duplicates). Hence, the maximum number of consumers is the number of partition, otherwise some consumers will be doing nothing. Note that when we have multiple consumers in a group, messages are read in order within a partition, but in parallel across partitions.
So a sample situation might be the following:
We have one topic split into three partitions, and we have a consumer group with three members, each reading one partition.
But what happens if there are new consumers enjoying the group or actual consumers leaving? How can Kafka decide to reassign partitions and rebalance the work among the new consumers’ configuration?
What Kafka needs is a group coordinator, hence one of the brokers is elected as such. When a consumer wants to enjoy or leave the group, he has to send the request to the group coordinator. The first consumer enjoying a group becomes the group leader. Hence, when some kind of rebalance is needed, the group coordinator decides how to reassign partitions (keeping in mind the upper limit to the number of consumers), while the group leader is in charge of executing this operation.
Well, now that you have an idea of the architecture of Kafka, let’s run a very basic example.
After creating my virtual environment for Kafka, I created a new topic called “test”, which won’t be divided into partitions (indeed, it will only have one partition) and won’t be replicated many times (its replication factor is equal to one);
(kafka) [root@quickstart kafka] kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
Now I’m creating my producer, writing three messages he will send to the consumer:
(kafka) [root@quickstart kafka] kafka-console-producer --broker-list localhost:9092 --topic test >hello >this is my first message >this is my second message
Finally, let’s set our consumer, which will be able to receive the messages the producer wrote above:
(kafka) [root@quickstart kafka] kafka-console-consumer --bootstrap-server localhost:9092 --topic test --from-beginning hello this is mt first message this is my second message
Apache Kafka, if implemented with Spark Streaming, allows you not only to keep track of streaming data, but also to build relevant analytics on it. This tool is particularly powerful if you need to keep track of working engines and focus on latencies or sub-optimal executions, or if you need an idea of the market’s response to your new business strategy.
Whatever the business sector involved, real-time data analysis could add relevant value to it.