Apache Kafka: The Secret Sauce to Real-Time Data Streaming

~~toc~~

WHAT IS KAFKA?

Apache Kafka is a free and open-source distributed event streaming technology optimized for processing streaming data in real time that is used by hundreds of businesses for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Kafka can handle millions of messages per second and is horizontally scalable and fault-tolerant. It can be used to transfer data from one system to another, for example, from a database to a real-time analytics engine or from a sensor to a cloud storage system.

It is a publish-subscribe messaging system that acts as a middleman between two parties, namely the sender and the recipient.

WHY WAS KAFKA DEVELOPED?

Kafka was developed in 2011 at LinkedIn to address the company's requirement for a scalable and reliable solution to process real-time data.

LinkedIn required a system capable of handling the massive amounts of data created by its members, such as activity updates, communications, and profile modifications.

Kafka also had to be able to manage the traffic surges that LinkedIn encountered at various times of the day.

Many other firms, like Netflix, Uber, Twitter, and Airbnb, have subsequently embraced Kafka. These businesses utilize Kafka for a variety of objectives, including real-time data streaming to their websites and applications, developing data pipelines, and powering machine learning algorithms.

KAFKA's LIFE STORY

Kafka was originally made available in January 2011.

Because of its scalability, dependability, and performance, Kafka quickly gained attraction. Kafka was promoted to the position of top-level Apache project in 2012.

Kafka is now one of the world's most popular event streaming services. It is utilized by businesses of all sizes, ranging from startups to Fortune 500 corporations.

Confluent currently maintains it under the Apache Software Foundation.

THE SIGNIFICANCE OF KAFKA

Kafka is significant because it allows enterprises to create real-time data pipelines and streaming applications. This enables organizations to respond to events as they occur and make decisions based on the most up-to-date data.

Kafka, for example, can be used to create a real-time fraud detection system that can detect and block fraudulent transactions as they happen.

Kafka may also be used to create a real-time recommendation system that can recommend things to clients based on their previous purchases and browsing history.

Kafka is also important since it is a scalable and trustworthy platform. It can manage massive amounts of data and is resilient to errors. As a result, it is suited for mission-critical applications.

KAFKA ARCHITECTURE

Broker Nodes:

It is in charge of the majority of I/O operations and long-term persistence within the cluster.

In general, a broker is a person who arranges transactions between buyers and sellers in exchange for a commission when the transaction is completed.

Similarly, a Kafka broker acts as a go-between for the producer and the consumer. The actual data is stored in the Kafka broker. It is where topics are kept.

A Kafka cluster can consist of one or more operating servers (brokers). It receives messages from producers and stores them on disk using a unique offset as a key.

It also enables users to retrieve messages from topics. Writes new events to partitions and replicates partitions among themselves.

Zookeeper node:

It is free and open-source software. Zookeeper is used by Kafka to maintain all Brokers. Some of the responsibilities of a Zookeeper include coordinating brokers, discovering new or deleted brokers, or newly added or altered topics, and so on.

Role of Kafka-Zookeeper

Controller Election - The controller is one of the Kafka brokers who is in charge of electing partition leaders as well as maintaining the leader-follower connection across all partitions. The cluster will have only one Kafka controller.

Topic Configuration- The configuration of all topics, including the list of existing topics, the number of partitions for each topic, and the location of all replicas.

Membership in the cluster - Zookeeper also keeps a list of all the brokers that are active and part of the cluster at any one time.

Access control lists (ACLs)- Zookeeper also maintains access control lists (ACLs) for all topics.

Topic:

A stream of messages belonging to a particular category is called a topic. It can be considered like a folder in a file system.

It is a common location where all the related messages are produced by a producer or read by consumers.

Topics are divided into partitions, and partitions are assigned to brokers. Topics thus enforce a sharding of data on the broker level.

When setting up Kafka for the first time, we should take care to both allocate enough partitions per topic, and fairly divide the partitions amongst your brokers.

Another thing we have is replication which provides high availability by optionally persisting each partition on multiple brokers.

In a replicated partition, Kafka will write messages to only one replica—the partition leader. The other replicas are followers, which fetch copies of the messages from the leader. Consumers may read from either the partition leader or from a follower as of Kafka version 2.4.

Additionally, any follower is eligible to serve as the partition leader if the current leader goes offline, provided the follower is recognized as an in-sync replica (ISR). Kafka considers a follower to be in sync if it has successfully fetched and acknowledged each message sent to the partition leader.

Producers:

Producers publish the message on a topic.It is simply a program that sends data or messages to the Kafka cluster.

Whatever data we submit, it is simply an array of bytes to Kafka. Assume we have an orders table in our database, where each entry corresponds to a message, and we use the Kafka producer API to transmit the messages to the Kafka Orders topic.

Let just see what are the basic things that need to be configured for Kafka producer API

clientid - This simply refers to the producer application that is supplying the data

bootstrap.servers - A list of brokers' addresses for connecting to the Kafka Cluster

key.serializer - A serializer that converts a key to a bytes

value.serializer - Serializer for batch converting values to bytes

batch.size - When several records are transmitted to the same partition, Kafka producers try to consolidate them into batches to increase throughput.

buffer.memory - The total bytes of memory available to the producer for buffering records waiting to be delivered to the server.

retries - If a producer request fails, a certain number of retries can be performed.

Now, in order to deliver data to topics, we have a send method in the Producer class that we specify the Producer Record in.

Producer Records are key/value pairs transmitted to the Kafka cluster. This includes the topic name to which the record is being delivered, an optional partition number, and a key and value.

The record also includes a timestamp.

Consumer

A consumer consumes the message from the topic.

Subscribers who read records from one or more topics and one or more partitions of a topic are known as Kafka consumers.

Consumers in Kafka are members of the Consumer group. When many consumers subscribe to a topic and are members of the same consumer group, each consumer in the group receives messages from a distinct subset of the topic partitions.

What is rebalancing in Kafka consumer: A rebalancing is the transfer of partition ownership from one consumer to another. Rebalances are crucial because they give high availability and scalability to the consumer group. Kafka handles it automatically.

Rebalance happens when :

A consumer joins the group.
A consumer leaves the group.
A consumer is considered Dead.This may happen after a crash or when the consumer is busy with a long-running processing.
New partitions are added.

The first step to start consuming records is to create a KafkaConsumer instance.

There are four main properties that needs to be set up first- bootstrap.servers, key. deserializer, value. deserializer and groupid.

Bootstrap.servers - This is the Kafka cluster connection string.

Key.deserializer - Similar to serializer in producer, except this time the conversion is from byte array to Java Object for key.

Value.deserializer - Like key deserializer, value deserializer converts a byte array to a Java Object.

Groupid - The name of the consumer group to which a Kafka consumer belongs.

There are some other properties also available- fetch.min.bytes,fetch.max.wait.ms,max.partition.fetch.bytes,session.timeout.ms,enable.auto.commit

Once we create a consumer, the next step is to subscribe to one or more topics. The subcribe() method takes a list of topics as a parameter.

Message:

A Kafka message is a data unit sent between Kafka producers and consumers. Kafka messages are saved in topics, which are logical data channels.

Kafka messages consist of the following components:

Key : The message's unique identification.
Value : The actual data of the message.
Timestamp : The time when the message was created.
Headers : Message metadata that is optional.

Messages in Kafka are usually tiny, ranging from a few bytes to a few kilobytes. As a result, they are perfect for streaming applications in which enormous amounts of data must be handled fast.

    {
   "key": "Alice",
   "value": "Made a payment of $200 to Bob",
   "timestamp": "Jun 25, 2020 at 2:06 p.m.",
   "headers": {
 	"transaction_id": "1234567890"
   }
 }

THE CORE APIs OF KAFKA

The Kafka core API is a collection of APIs that offer the fundamental functionality to produce and consume messages from Kafka topics. The following APIs comprise the core API:

The producer API allows programs to publish record streams to one or more Kafka topics.

The consumer API enables programs to subscribe to one or more topics and process the stream of records provided to them.

The admin client API enables programs to view and control Kafka objects including topics, brokers, and consumers.

The Connect API enables applications to create connectors for importing and exporting data between Kafka and other systems.

The Kafka Streams API provides a higher-level abstraction for developing stream processing applications.

These APIs are all accessible for Java, and client libraries are available for a variety of additional programming languages.

CONCLUSION

Kafka is a versatile and powerful event streaming technology that may be used for a number of tasks. It's scalable, dependable, and can handle millions of messages per second.

Kafka is a wonderful solution to explore if you want to construct real-time data pipelines and streaming applications.

Here are some more advantages of adopting Kafka:

Because Kafka is open source, it is free to use and modify.
Kafka has a huge and active community, therefore there is lots of help accessible.
Kafka is well-documented and straightforward to use.
Kafka integrates with several different tools and technologies.

Kafka is an excellent choice if you need a scalable, reliable, and easy-to-use event streaming platform.

If you are interested, Dive into Event-Driven Architecture with Spring Cloud Stream & Kafka Streams! and Learn how to build streaming applications with this comprehensive guide: learnjavaskills.in

Apache Kafka: The Secret Sauce to Real-Time Data Streaming

WHAT IS KAFKA?

WHY WAS KAFKA DEVELOPED?

KAFKA's LIFE STORY

THE SIGNIFICANCE OF KAFKA