1. Introduction
Apache Kafka is a powerful tool for big data processing, making it possible to process massive amounts of information with its remarkably low latency and high throughput capabilities. It can serve as a message broker that receives input from multiple sources and quickly makes those inputs available to target systems – all in real-time! This helpful guide will explore the underpinning of Apache Kafka architecture behind its impressive power.
2. What is Apache Kafka?
Apache Kafka is a remarkable tool for advanced data processing, allowing you to quickly handle tremendous volumes of information. It acts as an intermediary that gathers input from multiple sources and seamlessly delivers those inputs to the desired systems — in real time! With its low latency and high throughput capabilities, this robust solution provides scalability and flexibility.
The Kafka broker utilizes an immutable log of messages that can be organized into topics. At the same time, it’s core architectural concept uses a file system, or database commit log, to keep track of all sent messages so it can replay them easily at any given time, thus helping maintain a consistent system state. Kafka ensures your data is stored and protected, serialized in a fault-tolerant way. As it distributes all information across different nodes, you can scale up performance while reducing the impact of node breakdowns.
2.1. What can Apache Kafka do for you?
Kafka’s universality, flexibility and performance are renowned, yet there are certain applications for which it stands out.
2.1.1. Stream Processing
Utilizing multiple Kafka brokers, users can create multiple-stage data processing pipelines. They can consume raw input data from topics and then manipulate it with aggregation, enrichment, or other transformations before finally pushing the transformed content into newer topics for further consumption or additional processing.
For example, by processing pipelines that crawl data from a database and post it to the “stocks” topic, one can access the most up-to-date stock values on an exchange. The content is normalized, or duplicates are removed before publishing fresh price information on another topic – thus enabling investors easy access to market analysis.
Additionally, these sophisticated pathways allow users to construct graphs of real-time flows based on individual topic data.
2.1.2. Messaging
Apache Kafka is a great choice to replace the traditional message broker due to its superior throughput, integrated partitioning, replication and fault tolerance. Plus, it engages in low end-to-end latency for messaging uses which usually require high durability guarantees. Kafka could easily fulfil all requirements from experience with various applications involving large-scale message processing tasks.
2.1.3. Log Aggregation
Apache Kafka brokers simplify physical log file collection from servers and put them in one centralized location for further processing. Instead of solely dealing with files, this messaging system gives a more efficient abstraction of event data as an easily consumable stream. This helps reduce latency when handling multiple sources and distributed consumption while providing better processing capabilities.
2.1.4. Commit Log
Kafka serves as an external commit log for a distributed system and provides the invaluable benefit of log compaction. This feature allows data to be replicated between nodes and aids in re-syncing any failed nodes so they can easily restore their information. Businesses can easily maximize their efficiency and productivity when utilising this powerful tool.
3. Apache Kafka System Event-Driven Architecture
An event represents something occurring in our world; people may refer to them as records or messages too! Kafka brokers send and receive data as events, named messages. These messages are organized into topics which can be published by producer applications or subscribed to by those consuming it — therefore earning the title of a “publish/subscribe” system.
3.1. Kafka Cluster
A Kafka cluster is a distributed computing system in which multiple devices collaborate to achieve an objective.
Several Kafka brokers can form a powerful Kafka cluster that primarily aims to share workloads across replicas and partitions equally. Furthermore, these kinds of clusters can be scaled without disruption; additionally, they ensure that data messages are safely stored and replicated so that if one broker fails, another will take its place without causing any loss or latency decay.
Here we have a single Apache Kafka broker and its typical interactions inside the system.
The producer sends streams of data to the Kafka topics. Messages are available in real-time and consumed by a subscribed consumer. When a subscription to a topic is active, the published messages are received immediately.
With Kafka Streams, stream processors can use state stores to store and query data. This feature allows developers to leverage powerful stateful operations for data transformation between different formats.
Apache Kafka Connect provides a wide range of data integration options due to its various plugins, all of which have unique configurations for access to distinct sources.
4. Concept of Kafka Topic, Partition and Offset
Kafka topics store messages on the disk. Immutable messages are retained based on the customizable retention period per topic. Each topic can have multiple partitions that count messages according to the published sequence.
4.1. Topic Partition
Partitioning is critical in most distributed systems, and Kafka is no exception. With Kafka, topics are split into a number of partitions; each partition works like an independent log file that stores records exclusively in append-only style. So all the data related to a specific topic is divided among its respective partitions for optimal storage efficiency.
By distributing the partitions of one topic across many brokers, the Kafka cluster provides unparalleled scalability and robust fault tolerance. This is because multiple brokers can access each partition in parallel at once. At the same time, replicas are scattered throughout the Kafka cluster to ensure that if a broker fails, another will take up where it left off without any disruption.
Users can decide the destination partitions for a message using a message key or let Kafka handle that automatically.
The offset represents the message position in a partition, while the order is only guaranteed within and not concerning other partitions. A multiple number of partitions are the way to achieve parallelism.
4.2. Streams Processor API
In Apache Kafka Architecture, the stream processor acts as a consumer that processes a stream of records one at a time from a topic using data transformation operations that can be stateless or stateful. The stateful operations use state stores that remember received input so it can be mutated as needed.
The data transformation happens within Kafka while being stored on a different topic. Kafka Streams Processor API allows data transformations with very few lines of code while utilizing application threads to enable concurrent processing of the messages.
For a deep dive into Streams Processor API, please visit here.
5. Kafka Consumer Application
The Consumer Application will always read data in order from a specific partition.
A single partition has a restriction of one consumer per consumer group. The current position of a consumer is called consumer offset. The consumer offset can change if we want to reprocess particular messages in the topic. One consumer can read from multiple partitions.
5.1. Idempotent Consumer
Idempotence is a term for applying something multiple times without changing the result beyond the initial application. In terms of the Kafka consumer application, we want to be able to handle duplicate messages in a case when processing them is dangerous. One approach is filtering out some messages using a caching solution or a database, depending on the circumstances. You can take a look at the idempotent consumer in more detail here.
5.2. Delivery Semantic
As we have said before, consumer offset is the current position of a consumer in the partition. We can store the position manually, or the Kafka cluster has a feature for automatic commit. The process is called delivery semantics, and there are three different types:
5.2.1. At Most Once
Offset is committed immediately; the message won’t be reprocessed if something goes wrong during the processing.
5.2.2. At Least Once
An offset commit happens after the message has been processed. The consumer will reprocess the message if something goes wrong in the processing. This requires the creation of an idempotent consumer so that reprocessing already processed messages won’t have an impact on the system.
5.2.3. Exactly Once
Processing a particular message exactly once. It is harder to create consumers for this semantic. It requires creating an idempotent producer with atomic transactions where either everything or none will happen, similar to the commit/rollback strategy in the database.
5. Kafka Producer Application
A single producer can send messages to multiple Kafka cluster topics. When a producer application publishes a message, Kafka selects the storage partition randomly. We can specify a key that will be a message identifier to avoid that.
Messages with matching keys will always be stored on the same partition. The only exception is when we add a new partition and there is existing data in the topic. Kafka may use the new partition to store messages with the same key in one of the old partitions.
6. Apache Kafka Connect
Kafka Connect is a data integration and transfer tool between non-Kafka sources into Kafka clusters from multiple data sources, such as a database or cloud storage, called distributed data consumption.
The platform is horizontally scalable and fault-tolerant as we can assign the desired number of workers to connect the Kafka cluster to different non-Kafka data sources using connectors.
A connector is a plugin (contained within a jar file in Java) for Kafka Connect that acts as an interface that allows interactions between Kafka and non-Kafka sources. Each has its configuration that can be adjusted to specific needs. Here we can browse for different connectors that are available and can either download them or install them using the hub client application.
7. Apache Kafka Zookeeeper
Zookeeper is a critical component in Kafka’s distributed infrastructure; it links brokers and consumers by storing metadata, allowing processes to connect through a shared set of nodes known as data registers. This ensures everyone has access to the same information from one centralized location.
Apache Kafka 3.0 has revolutionised how we manage our systems, as it is phasing out Zookeeper with its single point of failure and maintenance costs. Rather than requiring users to maintain another system, Kafka brokers will assume Zookeeper’s role by storing metadata locally in a file. The Controller ensures that registered brokers are up-to-date and swiftly removed upon failing. The startup process requires minimal computing resources from the broker, allowing more partitions to be supported without added CPU consumption!
8. Summary
In this article, we have looked at Kafka architecture in a small distributed system. Instead of building bridges between different systems to share data, Kafka Connect and Streams API allow producers and consumers to stretch far beyond their boundaries by communicating directly. The former establishes connection while the latter works wonders for transforming messages on the fly, guaranteeing your consumer precisely what it needs!
Daniel Barczak
Daniel Barczak is a software developer with a solid 9-year track record in the industry. Outside the office, Daniel is passionate about home automation. He dedicates his free time to tinkering with the latest smart home technologies and engaging in DIY projects that enhance and automate the functionality of living spaces, reflecting his enthusiasm and passion for smart home solutions.
Leave a Reply