Fundamentals

1. Scalability

As a system grows, the performance starts to degrade unless one adapt it to deal with that growth.

Scalability is the property of a system to handle a growing amount of load by adding resources to the system.

A system that can continuously evolve to support a growing amount of work is scalable.

2. Availability

Availability refers to the proportion of time a system is operational and accessible when required.

Availability = Uptime / (Uptime + Downtime)

Uptime: The period during which a system is functional a n d accessible.

Downtime: The period during which a system is unavailable due to failures, maintenance, or other issues.

3. Latency vs Throughput

Latency

Latency refers to the time it takes for a single operation or request to complete.

  • Low latency means faster response times and a more responsive system.

  • High latency can lead to lag and poor user experience.

Throughput

Throughput measures the amount of work done or data processed in a given period of time.

  • It is typically expressed in terms of requests per second (RPS) or transactions per second (TPS).

4. CAP Theorem

CAP stands for Consistency, Availability, and Partition Tolerance, and the theorem states that:

It is impossible for a distributed data store to simultaneously provide all three guarantees.

  • Consistency (C): Every read receives the most recent write or an error.

  • Availability (A): Every request (read or write) receives a non-error response, without guarantee that it contains t h e most recent write.

  • Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

5. Load Balancers

Load Balancers distribute incoming network traffic across multiple servers to ensure that no single server is overwhelmed.

Popular Load Balancing Algorithms:

  1. Round Robin: Distributes requests evenly in circular order.
  2. Weighted Round Robin: Distributes requests based on server capacity weights.
  3. Least Connections: Sends requests t o server with fewest active connections.
  4. Least Response Time: Routes requests to server with fastest response.
  5. IP Hash: Assigns requests based on hashed client IP address.

6. Databases

A database is an organized collection of structured or unstructured data that can be easily accessed, managed, and updated.

Types of Databases

  1. Relational Databases (RDBMS)
  2. NoSQL Databases
  3. In-Memory Databases
  4. Graph Databases
  5. Time Series Databases
  6. Spatial Databases

7. Content Delivery Network (CDN)

A CDN is a geographically distributed network of servers that work together to deliver web content (like HTML pages, JavaScript files, stylesheets, images, and videos) to users based on their geographic location.

The primary purpose of a CDN is to deliver content to end-users with high availability and performance by reducing the physical distance between the server and the user.

When a user requests content from a website, the CDN redirects the request to the nearest server in its network, reducing latency and improving load times.

8. Message Queues

A message queue is a communication mechanism that enables different parts of a system to send and receive messages asynchronously.

Producers can send messages to the queue and move on to other tasks without waiting for consumers to process the messages.

Multiple consumers can pull messages from the queue, allowing work to be distributed and balanced across different consumers.

9. Rate Limiting

Rate limiting helps protects services from being overwhelmed by too many requests from a single user or client.

Rate Limiting Algorithms:

  1. Token Bucket: Allows bursts traffic within overall rate limit.
  2. Leaky Bucket: Smooths traffic flow at constant rate.
  3. Fixed Window Counter: Limits requests in fixed time intervals.
  4. Sliding Window Log: Tracks requests within rolling time window.
  5. Sliding Window Counter: Smooths rate between adjacent fixed windows.

10. Database Indexes

A database index is a super-efficient lookup table that allows a database to find data much faster.

It holds the indexed column values along with pointers to the corresponding rows in the table.

Without an index, the database might have to scan every single row in a massive table to find what you want - a painfully slow process.

But, with an index, the database can zero in on the exact location of the desired data using the index's pointers.

11. Caching

Caching is a technique used to temporarily store copies of data in high-speed storage layers to reduce the time taken to access data.

The primary goal of caching is to improve system performance by reducing latency, offloading the main data store, and providing faster data retrieval.

Caching Strategies:

  1. Read-Through Cache: Automatically fetches and caches missing data from source.
  2. Write-Through Cache: Writes data to cache and source simultaneously.
  3. Write-Back Cache: Writes to cache first, updates source later.
  4. Cache-Aside: Application manages data retrieval and cache population.

Caching Eviction Policies:

  1. Least Recently Used (LRU): Removes the item that hasn't been accessed for the longest time.
  2. Least Frequently Used (LFU): Discards items with the lowest access frequency over time.
  3. First In First Out (FIFO): Removes the oldest item, regardless of its usage frequency.
  4. Time-to-Live (TTL): Automatically removes items after a specified time has passed.
  5. Last In First Out (LIFO): Evicts the most recent item.

12. Consistent Hashing

Consistent Hashing is a special kind of hashing technique that allows for efficient distribution of data across a cluster of nodes.

Consistent hashing ensures that only a small portion of the data needs to b e reassigned when nodes are added or removed.

How Does it Work?

  1. Hash Space: Imagine a fixed circular space or "ring" ranging from O to 2^n-1.
  2. Mapping Servers: Each server is mapped to one or more points on this ring using a hash function.
  3. Mapping Data: Each data item is also hashed onto the ring.
  4. Data Assignment: A data item is stored on the first server encountered while moving clockwise on the ring from the item's position.

13. Database Sharding

Database sharding is a horizontal scaling technique used to split a large database into smaller, independent pieces called shards.

These shards are then distributed across multiple servers or nodes, each responsible for handling a specific subset of the data.

By distributing the data across multiple nodes, sharding can significantly reduce the load on any single server, resulting in faster query execution and improved overall system performance.

14. Consensus Algorithms

In a distributed system, nodes need to work together to maintain a consistent state.

However, due to the inherent challenges like network latency, node failures, and asynchrony, achieving this consistency is not straightforward.

Consensus algorithms address these challenges by ensuring that all participating nodes agree on the same state or sequence of events, even when some nodes might fail or act maliciously.

Popular Consensus Algorithms

  1. Paxos: Paxos works by electing a leader that proposes a value, which is then accepted by a majority of the nodes.
  2. Raft: Raft works by designating one node as the leader t o manage log replication and ensure consistency across the cluster.

15. Proxy Servers

A proxy server acts as a gateway between you and the internet. It's an intermediary server separating end users from the websites they browse.

2 Common types of Proxy Servers:

  1. Forward Proxies: Sits in front o f a client and forwards requests t o the internet on behalf of the client.
  2. Reverse Proxies: Sits in front of a web server and forwards requests from clients to the server.

16. HeartBeats

In distributed systems, a heartbeat is a periodic message sent from one component to another to monitor each other's health and status.

Without a heartbeat mechanism, it's hard to quickly detect failures in a distributed system, leading to:

  • Delayed fault detection and recovery
  • Increased down time and errors
  • Decreased overall system reliability

17. Checksums

A checksum is a unique fingerprint attached to the data before it's transmitted.

When the data arrives at the recipient's end, the fingerprint is recalculated to ensure it matches the original one.

If the checksum of a piece of data matches the expected value, you can be confident that the data hasn't been modified or damaged.

18. Service Discovery

Service discovery is a mechanism that allows services in a distributed system to find and communicate with each other dynamically.

It hides the complex details of where services are located, so they can interact without knowing each other's exact network spots.

Service discovery registers and maintains a record of all your services in a service registry.

This service registry acts as a single source of truth that allows your services to query and communicate with each other.

19. Bloom Filters

A Bloom filter is a probabilistic data structure that is primarily used to determine whether an element is definitely not in a set or possibly in the set.

How Does It Work?

  1. Setup: Start with a bit array of m bits, all set to 0, and k different hash functions.
  2. Adding an element: To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.
  3. Querying: To query for an element, feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is definitely not in the set. If all are 1, then either the element is in the set, or we have a false positive.

20. Gossip Protocol

Gossip Protocol is a decentralized communication protocol used in distributed systems t o spread information across all nodes.

It is inspired by the way humans share news by word- of-mouth, where each person who learns the information shares it with others, leading to widespread dissemination.

How does it work?

  1. Initialization: A node in the system starts with a piece of information, known as a "gossip."
  2. Gossip Exchange: At regular intervals, each node randomly selects another node and shares its current gossip. The receiving node then merges the received gossip with its own.
  3. Propagation: The process repeats, with each node spreading the gossip to others.
  4. Convergence: Eventually, every node in the network will have received the gossip, ensuring that all nodes have consistent information.


Architecture Concepts

ConceptDiagramDescription
Load Balancing
Distributed incoming traffic across multiple servers to ensure no single node is overwhelmed.
Caching
Stores frequently accessed data in memory to reduce latency.
Content Delivery Network (CDN)
Stores static assets across geographically distributed edge servers so users download content from the nearest location.
Message Queue
Decouples components by letting producers enqueue messages that consumers process asynchronously.
Publish-Subscribe
Enables multiple consumers to receive messages from a topic.
API Gateway
Acts as a single entry point for client requests, handling routing, authentication, rate limiting, and protocol translation.
Circuit Breaker
Monitors downstream service calls and stops attempts when failures exceed a threshold.
Service Discovery
Automatically tracks available service instances so components can locate and communicate with each other dynamically.
Sharding
Splits large datasets across multiple nodes based on a specific shard key.
Rate Limiting
Controls the number of requests a client can make in a given time window to protect services from overload.
Consistent Hashing
Distributes data across nodes in a way that minimizes reorganization when nodes join or leave.