inspired by Java Champion Nikhil Nanivadekar
- What is Kafka?
- Broker (Kafka server)
1. What is Kafka?
There are many definitions, but in essence, Kafka is nothing but a transactional log.
It is similar to a database transaction log.
This means: the source of truth in term of data comes from the transaction log.
The difference with a DB is that Kafka comes with important features for our ecosystem.
Let's look at some of them:
- Kafka guarantees ordering for every partition in a topic (definitions later below)
- Kafka guarantees delivery as long as (n-1) servers are available (we'll get back to it later below)
Let's detail the Kafka components !
👉Zookeeper is required in Kafka ecosystem because all the configs necessary to retain.
👉Zookeeper is also in charge of electing a leader in a Kafka cluster.
👉Zookeeper also ensures that data is replicated to all the necessary nodes
◼ Broker (Server):
👉It hosts the Kafka topic and the related data published by the topic\
👉You generally have several brokers which are part of a cluster managed by Zookeeper
👉It is a logical separation of messages
👉A topic is further divided into partitions where the consumer picks up teh data
👉The higher the number of partitions, the higher the possibility of parallel consumption for the consumer is.
Produces messages and publishes them to a particular topic (and partition)
Consumes messages from a particular topic (and partition)
⚠ There is a paramount point about consumers:
One partition can be consumed only by one consumer however 1 consumer can consume from multiple partitions.
- Start of Zookeeper:
- Start of the broker:
This broker connects to the Zookeeper instance created just before (logged in the Zookeeper logs).
Note that the amount of data retained by a broker directly depends on the amount of file/disk space that you have on your system, because all is stored in logs to be retrieved/delivered.
- Creation of a topic called 'test':
Here the replication factor is at 1, but generally in integrated environment, you have more in case of one of the servers go down.
Kafka topics are nothing but logical separation.
To understand the value of this logical separation, think of a travel system proponing tickets for bus/boat/plane/train.
If there were not seperation all the consumers would have to process useless data (say the plane would have to consume bus/boat/train messages that it does not care).
It is for efficiency to use separation.
It is the producer responsibility to produce the messages to the appropriate topic.
That speeds up the messages consumption.
- Creation of the producer:
- Creation of the consumer:
So we start the producer while the consumer is still not started.
The producer produces messages at his own rhythm that the consumer will consume also as its own rhythm when it will be started.
That way the producer and the consumer are decoupled.
- We pause the producer after producing 9 messages, now we start the consumer, and the consumer consumes the nine messages. Now both are on, and the consumer consumes the messages produced by the producer.
- We killed the consumer (simulating a server down) and produces new messages.
- At restarting the consumer, the consumer consumes the last new messages.
This is thanks to two Kafka concepts: Consumer Groups and Offsets
◼ Consumer group: a consumer groups gets the messages which are not already consumed.
Note that you can have 1 producer and multiple consumers consuming from the exact same topic; but consumer groups allows to decouple the consumers.
Kafka will deliver messages that are not consumed by a particular consumer group.
So if I create a new consumer with a different group:
If we start this consumer, we will get all the messages from the beginning.
If the producer produces a new message, we see that both consumers (the one of consumer-group and the other of consumer-group-1) consume the message 17.
What if we stop the consumer of consumer-group and the producer send new messages (18 19 20 21) and then we restart the consumer.
That consumer will consume the last messages (18 19 20 21).
How is that possible?
It thanks to the following Kafka concept: the offsets.
◼ Offsets: a unique identifier for every single message (say an id for every single message in a topic and partition).
That permits to connect to a particular message in the transaction log.
Zookeeper use that offset to determine the highest offset and so is able to match with the consumer for the missing messages (from the last consumed offset to the last offset available).
Each message has an offset.
Zookeeper retains the highest produced offset for a topic and partition.
Zookeeper also retains the last consumed offset for a consumer group from a topic and partition.
When consumption starts depending on the configuration, Kafka will deliver the messages between the last consumed to the latest available offset.
Remember: 1 consumer can consume from multiple partitions. However, there can be one and only one consumer consuming from a particular partition.
So if two consumers belong to a same group, one will get the message. The choice of this latter is done by Kafka rebalancing.
Now, let's switch to Java:
And another consumer: Consumer2, literally a copy paste of consumer1 (except class name).
So we have one producer and two consumer for a topic.
- Let's launch it with three partitions:
And we run as applications from IntelliJ the consumers and the producer, and we noticed that the consumers consume the messages produced by the producer, with a partition by consumer (according to the logs);
If you kill a consumer, other consumer gets the flow of killed consumer by Kafka rebalancing.
If we have more consumers than partitions, one partition is only bound to one consumer, so a consumer can stay without any messages.
Remember that Kafka rebalancing is indeterministic say one consumer can receive messages and then after rebalancing can no more receive them.
But you can force that rebalancing.