Why Use Kafka?

Apache Kafka is a distributed data streaming platform designed for high-throughput, real-time data processing. It is used to handle continuous streams of data from many different sources at the same time.

Key features of Kafka:

  1. Scalability: Kafka distributes data across multiple servers, allowing it to handle massive amounts of information beyond a single machine's capacity.

  2. Speed: With its decoupled write and read data streams, Kafka offers extremely low latency, making it ideal for real-time applications.

  3. Durability: Data is replicated across servers and written to disk, protecting against server failures.

  4. Flexible data retention: Kafka can store data for configurable periods, from seconds to years.

  5. Multiple consumer support: Unlike traditional message queues, Kafka allows multiple applications to read the same data stream independently.

Kafka's architecture combines messaging queue and publish-subscribe models. It organizes data into topics that are partitioned. The messages within a partition are processed in order.

Kafka is used to build real time data pipelines. For example, Uber drivers send their location to the server every few seconds. This location is added to a Kafka queue. From the queue, the location is read by different consumers for applications like matching drivers with riders or showing nearby drivers to the riders on the app.

How to Scale Kafka in your System Design Interview?

Your e-commerce system processes 1,000 orders/second. Black Friday hits - suddenly it's 50,000 orders/second. How do you scale Kafka?

๐Ÿงฉ ๐—”๐—ฑ๐—ฑ ๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€ ๐—ณ๐—ผ๐—ฟ ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—น๐—น๐—ฒ๐—น ๐—ฝ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ถ๐—ป๐—ด : Partitions are independent, ordered logs. Each partition guarantees order within itself. Customer A's orders in partition 7 stay in sequence, while Customer B's orders in partition 23 process in parallel. More partitions = more throughput.

๐Ÿ–ฅ๏ธ ๐—ฆ๐—ฐ๐—ฎ๐—น๐—ฒ ๐—ฏ๐—ฟ๐—ผ๐—ธ๐—ฒ๐—ฟ๐˜€ ๐—ต๐—ผ๐—ฟ๐—ถ๐˜‡๐—ผ๐—ป๐˜๐—ฎ๐—น๐—น๐˜† : Brokers are Kafka servers that host partitions. Running 50 partitions on 3 brokers? Add more brokers to spread the load. Kafka automatically rebalances partitions across all brokers.

๐Ÿ‘ฅ ๐—ฆ๐—ฐ๐—ฎ๐—น๐—ฒ ๐—ฐ๐—ผ๐—ป๐˜€๐˜‚๐—บ๐—ฒ๐—ฟ ๐—ด๐—ฟ๐—ผ๐˜‚๐—ฝ๐˜€ : You can have up to one consumer per partition. With 50 partitions, you can run up to 50 parallel consumers. Each sees events in order within their partition. Need more parallelism? Add partitions.

๐—ง๐—ต๐—ฒ ๐—ธ๐—ฒ๐˜† ๐˜๐—ฟ๐—ฎ๐—ฑ๐—ฒ๐—ผ๐—ณ๐—ณ: More partitions = more throughput BUT ordering only within each partition, not globally.