Why Use Kafka?
Apache Kafka is a distributed data streaming platform designed for high-throughput, real-time data processing. It is used to handle continuous streams of data from many different sources at the same time.
Key features of Kafka:
-
Scalability: Kafka distributes data across multiple servers, allowing it to handle massive amounts of information beyond a single machine's capacity.
-
Speed: With its decoupled write and read data streams, Kafka offers extremely low latency, making it ideal for real-time applications.
-
Durability: Data is replicated across servers and written to disk, protecting against server failures.
-
Flexible data retention: Kafka can store data for configurable periods, from seconds to years.
-
Multiple consumer support: Unlike traditional message queues, Kafka allows multiple applications to read the same data stream independently.
Kafka's architecture combines messaging queue and publish-subscribe models. It organizes data into topics that are partitioned. The messages within a partition are processed in order.
Kafka is used to build real time data pipelines. For example, Uber drivers send their location to the server every few seconds. This location is added to a Kafka queue. From the queue, the location is read by different consumers for applications like matching drivers with riders or showing nearby drivers to the riders on the app.
How to Scale Kafka in your System Design Interview?
Your e-commerce system processes 1,000 orders/second. Black Friday hits - suddenly it's 50,000 orders/second. How do you scale Kafka?
๐งฉ ๐๐ฑ๐ฑ ๐ฝ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป๐ ๐ณ๐ผ๐ฟ ๐ฝ๐ฎ๐ฟ๐ฎ๐น๐น๐ฒ๐น ๐ฝ๐ฟ๐ผ๐ฐ๐ฒ๐๐๐ถ๐ป๐ด : Partitions are independent, ordered logs. Each partition guarantees order within itself. Customer A's orders in partition 7 stay in sequence, while Customer B's orders in partition 23 process in parallel. More partitions = more throughput.
๐ฅ๏ธ ๐ฆ๐ฐ๐ฎ๐น๐ฒ ๐ฏ๐ฟ๐ผ๐ธ๐ฒ๐ฟ๐ ๐ต๐ผ๐ฟ๐ถ๐๐ผ๐ป๐๐ฎ๐น๐น๐ : Brokers are Kafka servers that host partitions. Running 50 partitions on 3 brokers? Add more brokers to spread the load. Kafka automatically rebalances partitions across all brokers.
๐ฅ ๐ฆ๐ฐ๐ฎ๐น๐ฒ ๐ฐ๐ผ๐ป๐๐๐บ๐ฒ๐ฟ ๐ด๐ฟ๐ผ๐๐ฝ๐ : You can have up to one consumer per partition. With 50 partitions, you can run up to 50 parallel consumers. Each sees events in order within their partition. Need more parallelism? Add partitions.
๐ง๐ต๐ฒ ๐ธ๐ฒ๐ ๐๐ฟ๐ฎ๐ฑ๐ฒ๐ผ๐ณ๐ณ: More partitions = more throughput BUT ordering only within each partition, not globally.