TinyURL system design interview question
Creating a streamlined and efficient URL shortening system requires a comprehensive approach, focusing on link generation, scalability, performance, and reliability. This article will provide a detailed guide for designing an effective URL shortening system with a particular focus on collision avoidance and eventual consistency, aimed at helping programmers prepare for system design interviews.
Credit: https://youtu.be/JQDHz72OA3c?si=wC_p5GxBcd7aJ7aj and https://www.youtube.com/watch?v=5V6Lam8GZo4&t=843s.
System Requirements
Functional Requirements:
Link Generation: The system should generate 7-character short web pages URLs.
Non-Functional Requirements:
Scalability: Handle a large number of users and links.
Performance: Ensure low latency in link generation and resolution.
Reliability: Ensure high availability and fault tolerance.
Capacity Model
To ensure the system can handle the required load, we need to build a capacity model:
Daily Link Generation and Read/Write Assumptions:
— Write: assume the system generates 50 million links per day (or 18.25 billion per year).
— Read: assume each short URL is read 100 times more frequently than it is written (or 1.8 trillion per year).
Storage Requirements per Year:
— Average entry size: assume each entry in the database (short URL, original URL, metadata) requires approximately 500 bytes.
— Total Storage: 50 billion entries * 500 bytes are 9.125 terabytes.
Despite requiring only ~9 terabytes of storage per year, the system needs to be designed to handle a very high read-to-write ratio efficiently.
Link Generation Algorithm (Base62 vs Hashing)
Base62 Encoding:
- Description: Converts unique ID to 7-character base62 string.
- Collision Handling Complexity: Must be depends on a counter/UUID, while challenging to scale; it has no collision risk (with 18.25 billion URLs in a year it will take 193 years to exhaust all the combinations).
Example: https://tiny.url/dnh (from ID 12345).
Hashing (Limited to First 7 Characters):
- Description: Uses MD5 or SHA-256, limited to first 7 characters.
- Collision Handling Complexity: Must handle collision risk via DB-level detection; it will for sure affects performance.
Example: https://tiny.url/098f6bc (from https://example.com).
High Level design
In order to avoid db-level collision handling, we will choose base62 encoding, but it must encode the counter in order to avoid collisions, looks easy if we have a single link generation service:
In order to handle the link generation at scale we will need to multipy the link generation service and use a Load Balancer to balance the load between them, but the challange is to handle the counter:
Lets disscuss how to implement the counter (DB-Level/Counter Service/Zoo Keeper):
1. Pre-generate the keys at DB-Level
One option is to pre-generate the entire set of possible keys of the counter in an ACID-compliant database.
In order to avoid collisions, each transaction will acquire a lock on a key using SELECT * FROM keys WHERE key = @key
.
Advantages:
- Centralized Control: Ensures a single source of truth for the counter, preventing duplicate IDs.
- Simplicity: Easier to implement as the database can handle increments automatically.
Disadvantages:
- Performance Bottlenecks: multiple transactions simultaneously attempt to acquire locks on keys, can lead to significant contention, and relying on a single database for key management creates a bottleneck.
- Scalability Issues: Scaling an ACID-compliant database horizontally is more challanging.
To address these issues we can choose MySql and shard the db based on the hasded key, which should be evenly distributed accross the shards. And try to reduce contention with optimistics locking, retrying within the same shard.
2. Counter Service:
A dedicated service to manage and distribute the counter across multiple instances of the URL generation service.
Advantages:
- Scalability: Allows for horizontal scaling of URL generation services.
- Fault Tolerance: Multiple instances of the counter service can be deployed to ensure high availability
Disadvantages:
- Additional Complexity: Requires the design and maintenance of an additional service.
- Synchronization Overhead: Ensuring consistency between counter service instances can introduce some overhead.
While this option looks good in terms of solving the collission issue at scale, it introduces some overhead and even require its own load balancer.
3. Apache ZooKeeper
Using Apache ZooKeeper to manage counters in a distributed manner.
Advantages:
- Reliability: ZooKeeper is designed for high availability and fault tolerance.
- Consistency: Provides strong consistency guarantees for distributed applications.
Disadvantages:
- Complexity: Requires understanding and managing ZooKeeper’s distributed coordination framework.
- Latency: Coordination overhead can introduce latency in high-throughput environments.
To use ZooKeeper for managing a counter in a URL shortening system, we will create a Process Flow:
- Each link generation service initially registers with ZooKeeper to request an initial counter range (for example 1–100,000, 100,001–200,000 etc).
- ZooKeeper assigns unique counter ranges to each service, which are stored in their respective memory.
- As a service exhausts its current counter range (or it goes down — we can throw the range to trash), it requests a new range from ZooKeeper.
- ZooKeeper ensures that each service receives non-overlapping and unique counter ranges, facilitating distributed coordination.
Database Choice Based on Requirements
When designing a URL shortening system with a focus on handling high-scale reads and write, with relative small amuont of data.
Scalability and Performance:
- NoSQL: Excels in horizontal scaling (as oppose to SQL), making it suitable for handling high volumes of link generations and accesses. Optimized for high throughput and low latency.
- NewSQL: Combines the scalability benefits of NoSQL with strong ACID transactions (which is not important when using ZooKeeper), suitable for high-throughput applications.
Reliability and Availability:
- NoSQL/NewSQL when integrated with ZooKeeper for managing counters, can ensure robust reliability and availability. Built-in replication and partitioning strategies mitigate the risk of downtime, crucial for maintaining continuous service operation.
Given the system’s emphasis on handling high-scale reads and writes efficiently with a relatively small amount of data (9.125 terabytes — To large for In-Memory db), NoSQL databases are favored over SQL and NewSQL:
- NoSQL excels in horizontal scaling, which is critical for accommodating high volumes of link generations and accesses.
- We avoids the need for strong ACID transactions managed by outside ZooKeeper.
- NoSQL offers flexibility without being tied to specific cloud provider ecosystems, unlike NewSQL solutions.
Other Concerns
DB Cleanups
After a while, we can assume the links are no longer needed. Therefore, we must think of a process to purge them and maybe return the counters of them.
DB Partitioning and Replicas
In a heavily read environment, the DB must be partitioned with replicas to ensure high availability and low latency.
Cache and Load Balancer
Implement caching mechanisms to reduce load on the DB and use load balancers to distribute traffic evenly across servers.
Security and Permissions
Ensure secure communication channels (e.g., HTTPS), and implement authentication and authorization mechanisms to restrict access to the URL shortening service.
Summary
Designing a URL shortening system requires addressing scalability, performance, and reliability challenges. NoSQL databases, integrated with Apache ZooKeeper for distributed counter management, excel in meeting these requirements without the overhead of strong ACID transactions. This approach ensures efficient handling of high-scale operations while maintaining system flexibility and availability.