Apache Kafka - Zookeeper

Apache Kafka - Zookeeper Tutorial

Topics covered in this article as below:

Need of Messaging System
What is Kafka?
Kafka Features
Kafka Components
Kafka Architecture
Installing Kafka
Working with Single Node Broker Cluster

1. Need of Messaging System

Data pipelines - Communication is required between different systems in the real-time scenario, which is done by using data pipelines

Chat Server -----------Data Pipeline---------------> Database server

Many services are communicating with various internal systems like front-end, back-end Api, database server, security system, real-time monitoring, chat server etc.

Messaging systems helps managing complexity of above pipelines. so Kafka helps here to decouples data pipelines

Key entity involved in Kafka

Producer (Publisher)

Consumer (Subscriber)

2. What is Kafka? (High throughput distributed messaging system)

Apache Kafka is a distributed publish-subscribe messaging system
It was originally developed at Linked-In and later became part of Apache Project
Kafka is fast, reliable, scalable, durable, fault-tolerant and distributed by design

Reliability - Its distributed, partitioned, replicated and fault tolerance
Scalability - Its messaging system scales easily without down time
Durability - Its uses 'Distributed commit log' which means messages persists on disk as fast as possible, hence it is durable
Performance - It has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored

2. Kafka Features

Producer : It can be any application or system who can publish messages/events to a topic
Consumer : It can be any application or system that subscribes to a topic and consume the messages/events
Broker : Kafka cluster is a set of servers, each of which is called a broker. Kafka broker is nothing but just a server. In simple word, A broker is just an intermediate entity that helps in message exchanges between a producer and consumer
Cluster : There can be one or more brokers in the Kafka cluster
Topic : It's specifies the category of message or classification feed name to which records are published. E.g. UPI payment, card payment, Flight & Movie booking, Mobile recharge etc. Create different type of topics based on your requirement as per above examples. In generic term you can consider as database like tables in Kafka echo system. where columns and records are maintained with unique tables
Partition : Topic are broken up into ordered commit logs called partitions
Offset: In Kafka, a sequence number is assigned to each message in each partition of Kafka topic. This sequence number is called as Offset.
Consumer Groups: As name suggests, Kafka consumer group is group of consumers. Multiple consumer combined to share the workload. It's just like dividing piece of large task among multiple individuals. We can't guaranteed that partition and consumer group talk to each other in the same order. It could be anything based on configuration of algorithms. Share the workload. Create nth number of consumer with each specific group name to overcome load and balanced the better throughput.
Zookeeper : It's prerequisite for Kafka. Its distributed system and it uses zookeeper for coordination and to track the status of Kafka cluster nodes. It also keeps track of Topics, Partitions, Offsets etc. It's used for managing and coordinating Kafka broker. It stored meta-data of topics, broker, cluster information etc.

Kafka Cluster:

Diagram to be updated here with multiple producer and consumer communication via Kafka cluster having multiple brokers. Zookeeper connected to it which used for coordination and management these brokers.

3. Kafka Features

High Throughput - Provides support for hundred of thousands of messages with modest infra.
Scalability - Highly scalable distributed systems with no downtime
Data Loss - Kafka ensures no data loss once configured properly
Stream Processing - Kafka can be used along with real time streaming applications like Spark and Storm
Durability - Provided support to persisting messages on disk
Replication - Messages can be replicated across clusters, which supports multiple subscribers

4. Kafka Components (Topics and Partitions)

PUB - a Topic is a category or feed name to which records are published
Topics are broken up into ordered commit logs called partitions
Each message in a partition is assigned a sequential-id called an offset
Data in a topic is retained for a configurable period of time
Writes to a partition are generally sequential thereby reducing the number of hard disk seeks
Reading messages can either be from the beginning & also can rewind or skip to any point in partition by giving an offset values

Anatomy of Topic

Partition 0 0,1,2,3,4,5 Writes

Partition 1 0,1,2,3,4,5 Writes

Partition 2 0,1,2,3,4,6 Writes

Old --------------------------------> New

Kafka Components - Topic, Partitions & Replicas

e.g.

Topic configured to use 4 partitions
Each partition has an ID
The ID of a replica is same as the ID of the broker that hosts it, i.e. [repl n] where n is the broker that it resides in
For each partition Kafka Broker will elect one replica as the 'leader' of a partition
If say, replication factor of a topic is set to 3, then Kafka will create 3 identical replicas of each partition and place those replicas on available brokers in the cluster

Kafka Components - Messages

A Unit of data in Kafka is a Message
Consider Message as record present in a database
To control the messages that are to be written to partitions, a key is used
Message with the same key are always written to the same partition

Kafka Components - Consumer

Consumers (subscribers or readers) read messages
The consumer subscribes to one or more topics and read the messages sequentially
The consumer keeps track of the messages it has consumed by keeping track on the offset of messages
The offset is bit of metadata(an integer value that continually increases) that Kafka adds to each message as it is produced
Each partition has a unique offset which is stored
With the offset of the consumed message, a consumer can stop and restart without loosing its current state

Kafka Components - Zookeeper

Zookeeper is used for managing and coordinating Kafka broker
Zookeeper service is mainly used for coordinating between brokers in the Kafka cluster
Kafka cluster is connected to Zookeeper to get information about any failure nodes

Kafka - Use Cases

Messaging

Applications can produce messages using Kafka, without being concerned about the format of the messages
Messages are sent and handled by single application that can read all of them consistently including

A common formatting of messages using a common look
Send multiple messages in a single notification
Receive messages in a way that meets the users preferences

Activity Tracking

Originally Kafka was designed at LinkedIn, to track user activity
When a user interacts with front-end applications, which generates messages regarding actions the user is taking
Kafka keeps track of simple information like click tracking to complex information like data in a user's profile

Metrics and Logging

Kafka is also ideal for collecting application's and system metrics and logs
Applications publish metrics on a regular basis to a Kafka topic, and those metrics can be consumed by systems for monitoring and alerting
Log messages can be published in the same way and routed to dedicated log search systems like Elasticsearch or security analysis applications

Commit log

Database changes can be published to Kafka and applications can easily monitor this streams to receive live updates as thy happen
Kafka replicates database updates to a remote system for consolidating changes from multiple applications in a single database view
Durable retention becomes useful providing a buffer for the change log, meaning it can be replayed in the event of a failure of the consuming applications
Log-compacted topics can be used to provide longer retention by only retaining a single change per key

Stream Processing

Stream processing term is typically used to refer applications that provide similar functionality to map/reduce processing in Hadoop
Stream processing operates on data in real-time, as quickly as messages are produced:

Write small applications to operate on Kafka messages
Performing tasks such as counting metrics
Partitioning messages for efficient processing by other applications

5. Installation of Kafka

Prerequisite - Java

Components: Apache Zookeeper, Apache Kafka

* User can install Zookeeper separately or else Kafka will come by in-build Zookeeper setup based on installation type.

Use official website to download Open source - Apache Kafka (Latest version - 3.4.x)

https://kafka.apache.org/downloads

Go & Download the latest stable version of Kafka

Three different type of Kafka Installation:

Open Source - Apache Kafka (https://kafka.apache.org/downloads)
Commercial distribution - Confluent Kafka (https://confluent.io) - (Offset Explorer: https://kafkatool.com)
Managed Kafka Service - Confluent & AWS

6. Kafka Cluster

Kafka brokers are designed to operate as part of a cluster
One broker will also function as the cluster controller
Controller is responsible for administrative operation like;

Assigning partitions to brokers
Monitoring for broker failures in a cluster

A particular partition is owned by a broker, and that broker is called the leader of partition
All consumers and producers operating on that partition must connect to the leader

Types of Kafka Clusters

Single Node-Single Broker Cluster
Single Node-Multiple Broker Cluster
Multiple Nodes-Multiple Broker Cluster

7. Kafka Command Line Interface - Hands-on - Command/Flow

Producer -> Consumer Flow

Start Zookeeper
Start Kafka Server
Create a Topic (Partition Count, Replication Factor)

Go to download zip folder. Navigate to "\bin" folder based on your OS like windows or Unix. Run the .sh or .bat respectively
Open first command line window or terminal

/bin/zookeeper-server-start.sh config/zookeeper.properties
Enter it, this will start zookeeper service - default port 2181

Open second terminal

/bin/kafka-server-start.sh config/server.properties
Enter it, this will start Kafka server / Broker - default port 9092

Open third terminal

/bin/kafka-topic.sh --bootstrap-server localhost:9092 --create --topic mytopicname --partition 3 --replication-factor 1
Enter it, this will create new Topic as "mytopicname"
You can create Topic as per your requirement
Now to get the created topic list you can run below command as
/bin/kafka-topic.sh --bootstrap-server localhost:9092 --list
/bin/kafka-topic.sh --bootstrap-server localhost:9092 --describe --topic mytopicname
you can get the details of topic with partition, replication factor, leader information etc.

Search This Blog

Quick Learn | FAQ's | Technical Solutions

Labels