How to Handle Network Failures ?

When network issues prevent servers from communicating with each other, your system needs to handle this gracefully. This is where partition tolerance comes in.

A network partition happens when some servers in your distributed system can't talk to others. Imagine you have servers in two data centers - US and Europe. If the connection between these data centers fails, you have a network partition. The servers still work, but they can't communicate across regions.

There are two main approaches to handle network partitions: CP (Consistent and Partition Tolerant) and AP (Available and Partition Tolerant).

CP systems prioritize consistency. When a partition occurs, they block operations that could make data inconsistent. For example, if you're running an online banking system, you might prefer to show an error message rather than risk showing incorrect account balances.

AP systems prioritize availability. They continue operating during partitions, even if it means data might temporarily be inconsistent. Social media platforms often choose this approach. If you can't post updates for a few minutes, that's better than the entire system being unavailable.

To detect network partitions, systems use heartbeat mechanisms. Each server regularly sends small messages saying "I'm alive" to other servers. If these messages stop arriving, the system assumes there's a partition.

Modern systems often use a quorum-based approach. Instead of requiring all servers to agree, decisions can be made if a majority (quorum) of servers agree. This helps systems stay operational even when some servers are unreachable.

Here's a comparison of CP vs AP systems:

CP	AP
Prioritize consistency	Prioritize availability
May become available	Stays available
Good for financial systems	Good for social media
Block operations during partions	Continue opations during partitions
Example: MongoDB	Example: Cassandra