Apache Kafka - Zookeeper

Apache Kafka - Zookeeper Tutorial

Topics covered in this article as below:
  1. Need of Messaging System
  2. What is Kafka?
  3. Kafka Features
  4. Kafka Components
  5. Kafka Architecture
  6. Installing Kafka
  7. Working with Single Node Broker Cluster

1. Need of Messaging System

Data pipelines - Communication is required between different systems in the real-time scenario, which is done by using data pipelines

Chat Server -----------Data Pipeline---------------> Database server

Many services are communicating with various internal systems like front-end, back-end Api, database server, security system, real-time monitoring, chat server etc.

Messaging systems helps managing complexity of above pipelines. so Kafka helps here to decouples data pipelines

Key entity involved in Kafka
Producer (Publisher)
Consumer (Subscriber)

2. What is Kafka? (High throughput distributed messaging system)

  • Apache Kafka is a distributed publish-subscribe messaging system
  • It was originally developed at Linked-In and later became part of Apache Project
  • Kafka is fast, reliable, scalable, durable, fault-tolerant and distributed by design
    • Reliability - Its distributed, partitioned, replicated and fault tolerance
    • Scalability - Its messaging system scales easily without down time
    • Durability - Its uses 'Distributed commit log' which means messages persists on disk as fast as possible, hence it is durable
    • Performance - It has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored 

2. Kafka Features

  • Producer : It can be any application or system who can publish messages/events to a topic
  • Consumer : It can be any application or system that subscribes to a topic and consume the messages/events
  • Broker : Kafka cluster is a set of servers, each of which is called a broker. Kafka broker is nothing but just a server. In simple word, A broker is just an intermediate entity that helps in message exchanges between a producer and consumer
  • Cluster : There can be one or more brokers in the Kafka cluster
  • Topic : It's specifies the category of message or classification feed name to which records are published. E.g. UPI payment, card payment, Flight & Movie booking, Mobile recharge etc. Create different type of topics based on your requirement as per above examples. In generic term you can consider as database like tables in Kafka echo system. where columns and records are maintained with unique tables
  • Partition : Topic are broken up into ordered commit logs called partitions 
  • Offset: In Kafka, a sequence number is assigned to each message in each partition of Kafka topic. This sequence number is called as Offset.
  • Consumer Groups: As name suggests, Kafka consumer group is group of consumers. Multiple consumer combined to share the workload. It's just like dividing piece of large task among multiple individuals. We can't guaranteed that partition and consumer group talk to each other in the same order. It could be anything based on configuration of algorithms. Share the workload. Create nth number of consumer with each specific group name to overcome load and balanced the better throughput. 
  • Zookeeper : It's prerequisite for Kafka. Its distributed system and it uses zookeeper for coordination and to track the status of Kafka cluster nodes. It also keeps track of Topics, Partitions, Offsets etc. It's used for managing and coordinating Kafka broker. It stored meta-data of topics, broker, cluster information etc.

Kafka Cluster:

Diagram to be updated here with multiple producer and consumer communication via Kafka cluster having multiple brokers. Zookeeper connected to it which used for coordination and management these brokers.

3. Kafka Features

  • High Throughput - Provides support for hundred of thousands of messages with modest infra.
  • Scalability - Highly scalable distributed systems with no downtime
  • Data Loss - Kafka ensures no data loss once configured properly
  • Stream Processing - Kafka can be used along with real time streaming applications like Spark and Storm
  • Durability - Provided support to persisting messages on disk
  • Replication - Messages can be replicated across clusters, which supports multiple subscribers

4. Kafka Components (Topics and Partitions)

  • PUB - a Topic is a category or feed name to which records are published
  • Topics are broken up into ordered commit logs called partitions
  • Each message in a partition is assigned a sequential-id called an offset
  • Data in a topic is retained for a configurable period of time
  • Writes to a partition are generally sequential thereby reducing the number of hard disk seeks
  • Reading messages can either be from the beginning & also can rewind or skip to any point in partition by giving an offset values
Anatomy of Topic
Partition 0            0,1,2,3,4,5    Writes
Partition 1            0,1,2,3,4,5    Writes
Partition 2            0,1,2,3,4,6    Writes

Old --------------------------------> New

Kafka Components - Topic, Partitions & Replicas
e.g.
  • Topic configured to use 4 partitions
  • Each partition has an ID
  • The ID of a replica is same as the ID of the broker that hosts it, i.e. [repl n] where n is the broker that it resides in  
  • For each partition Kafka Broker will elect one replica as the 'leader' of a partition
  • If say, replication factor of a topic is set to 3, then Kafka will create 3 identical replicas of each partition and place those replicas on available brokers in the cluster
Kafka Components - Messages
  • A Unit of data in Kafka is a Message
  • Consider Message as record present in a database
  • To control the messages that are to be written to partitions, a key is used
  • Message with the same key are always written to the same partition

Kafka Components - Consumer
  • Consumers (subscribers or readers) read messages
  • The consumer subscribes to one or more topics and read the messages sequentially
  • The consumer keeps track of the messages it has consumed by keeping track on the offset of messages
  • The offset is bit of metadata(an integer value that continually increases) that Kafka adds to each message as it is produced
  • Each partition has a unique offset which is stored
  • With the offset of the consumed message, a consumer can stop and restart without loosing its current state
Kafka Components - Zookeeper
  • Zookeeper is used for managing and coordinating Kafka broker
  • Zookeeper service is mainly used for coordinating between brokers in the Kafka cluster
  • Kafka cluster is connected to Zookeeper to get information about any failure nodes

Kafka - Use Cases
  • Messaging
    • Applications can produce messages using Kafka, without being concerned about the format of the messages
    • Messages are sent and handled by single application that can read all of them consistently including
      • A common formatting of messages using a common look
      • Send multiple messages in a single notification
      • Receive messages in a way that meets the users preferences
  • Activity Tracking
    • Originally Kafka was designed at LinkedIn, to track user activity
    • When a user interacts with front-end applications, which generates messages regarding actions the user is taking
    • Kafka keeps track of simple information like click tracking to complex information like data in a user's profile
  • Metrics and Logging
    • Kafka is also ideal for collecting application's and system metrics and logs
    • Applications publish metrics on a regular basis to a Kafka topic, and those metrics can be consumed by systems for monitoring and alerting
    • Log messages can be published in the same way and routed to dedicated log search systems like Elasticsearch or security analysis applications
  • Commit log
    • Database changes can be published to Kafka and applications can easily monitor this streams to receive live updates as thy happen
    • Kafka replicates database updates to a remote system for consolidating changes from multiple applications in a single database view
    • Durable retention becomes useful providing a buffer for the change log, meaning it can be replayed in the event of a failure of the consuming applications
    • Log-compacted topics can be used to provide longer retention by only retaining a single change per key
  • Stream Processing
    • Stream processing term is typically used to refer applications that provide similar functionality to map/reduce processing in Hadoop
    • Stream processing operates on data in real-time, as quickly as messages are produced:
      • Write small applications to operate on Kafka messages
      • Performing tasks such as counting metrics
      • Partitioning messages for efficient processing by other applications

5. Installation of Kafka

Prerequisite - Java
Components: Apache Zookeeper, Apache Kafka

* User can install Zookeeper separately or else Kafka will come by in-build Zookeeper setup based on installation type.

Use official website to download Open source - Apache Kafka (Latest version - 3.4.x)
https://kafka.apache.org/downloads

Go & Download the latest stable version of Kafka

Three different type of Kafka Installation:
  • Open Source - Apache Kafka (https://kafka.apache.org/downloads)
  • Commercial distribution - Confluent Kafka (https://confluent.io) - (Offset Explorer: https://kafkatool.com)
  • Managed Kafka Service - Confluent & AWS 





6. Kafka Cluster

  • Kafka brokers are designed to operate as part of a cluster
  • One broker will also function as the cluster controller
  • Controller is responsible for administrative operation like;
    • Assigning partitions to brokers
    • Monitoring for broker failures in a cluster
  • A particular partition is owned by a broker, and that broker is called the leader of partition
  • All consumers and producers operating on that partition must connect to the leader

Types of Kafka Clusters

  • Single Node-Single Broker Cluster
  • Single Node-Multiple Broker Cluster
  • Multiple Nodes-Multiple Broker Cluster

7. Kafka Command Line Interface - Hands-on - Command/Flow

  • Producer -> Consumer Flow
    1. Start Zookeeper
    2. Start Kafka Server
    3. Create a Topic (Partition Count, Replication Factor)
  • Go to download zip folder. Navigate to "\bin" folder based on your OS like windows or Unix. Run the .sh or .bat respectively
  • Open first command line window or terminal
    1. /bin/zookeeper-server-start.sh config/zookeeper.properties
    2. Enter it, this will start zookeeper service - default port 2181
  • Open second terminal
    1. /bin/kafka-server-start.sh config/server.properties
    2. Enter it, this will start Kafka server / Broker - default port 9092
  • Open third terminal
    1. /bin/kafka-topic.sh --bootstrap-server localhost:9092 --create --topic mytopicname --partition 3 --replication-factor 1
    2. Enter it, this will create new Topic as "mytopicname"
    3. You can create Topic as per your requirement
    4. Now to get the created topic list you can run below command as
    5. /bin/kafka-topic.sh --bootstrap-server localhost:9092 --list
    6. /bin/kafka-topic.sh --bootstrap-server localhost:9092 --describe --topic mytopicname
    7. you can get the details of topic with partition, replication factor, leader information etc.



































































































Comments

Popular posts from this blog

PUTTY - The server's host key is not cached in the registry cache

OIM-12c Installation - FMW - SOA - IDM