System Design Series Interview Designing Dropbox
The logic behind designing Dropbox
Introduction
Welcome to the inaugural article of our new series dedicated to system design. Throughout this series, we will delve into various aspects of system design, providing insights into the underlying technical concepts, their significance, and performance benchmarks for the associated technologies. In this initial article, we focus on detailing the design of Dropbox
What’s Dropbox, in the first place?
In simple terms, Dropbox is like Google Drive. It’s a solution that offers you the possibility to host pictures, documents, videos, and other files in the cloud that can be accessed later from any computer or device connected to the internet.
Key features
In a system design interview, the initial step is to outline the essential features of the system you’re going to create. This step is extremely important, regardless of how complex the system is, because it sets a clear direction for your conversation with the interviewer. Since you typically have a limited time, usually between 1.5 to 2 hours, you won’t be able to cover every single feature or component. That’s why it’s crucial to establish from the start which aspects you’ll be emphasizing.
So, the key features we’re gonna focus on today are:
- Users can upload their desktop files to the cloud.
- Users can download those files
- Users can edit/delete files uploaded
- Pay for more space — only 10 GB are free.
System requirements.
The next step in your discussion with the interviewer involves identifying the system prerequisites. This includes factors like the daily active user count, the volume of files uploaded each day, the overall user base, and so forth. These details play a crucial role in deciding the appropriate database, establishing a scaling strategy, and providing a preliminary understanding of the system’s overall structure.
For this case, we’re gonna consider the following requirements.
- We have 100M users
- We have 1M daily active users (DAU)
- Users on average upload a file of 10MB in size
- Users on average have 10 files stored
- Users make 100 edits per day
Brute force approach
When dealing with systems like this, the initial idea might be to simply upload the file to an S3 bucket or a similar storage service, and then maintain a record of the uploaded files (such as name, URL, etc.) in a database. Then, when a user wants to retrieve the files, they can access the necessary information through the database and download the file. Essentially, this is the basic concept that comes to mind.
Brute force approach — retrospective
The brute force method does function on a basic level; users can upload and download files. However, there are several issues with this initial design that need attention:
- If a user experiences a network interruption during an upload or download, they must repeat the entire action, which can be time-consuming for large files.
- In the case of a 2GB file with a minor modification (like changing one character), the entire 2GB file must be re-uploaded.
- Regarding bandwidth, both uploading and downloading a 2GB file requires 2 GB of bandwidth. Considering 1 million Daily Active Users (DAU), this would result in a total bandwidth of 2 million GB, which is not a scalable solution.
- Additionally, there are potential issues with packet limits to consider.
In summary, this initial approach lacks fault tolerance, as it requires re-uploading or re-downloading files in the event of a network interruption. It also poses challenges in terms of scalability and maintenance.
How to address these issues then?
That’s an excellent question. But before delving into the solution, I want to highlight the significance of initially suggesting a brute-force approach before arriving at the “ultimate” solution. This demonstrates that you possess a comprehensive understanding of the system, the ability to step back and contemplate your work, and most importantly, an adaptable mindset — an invaluable trait in the rapidly evolving field of IT.
Let’s explore the solution. In order to tackle this issue, the engineers at Dropbox introduced a method known as “chunking.” This process involves breaking down a file into smaller segments, referred to as “chunks,” using a specialized algorithm. Instead of uploading the entire file as a single entity, these smaller parts are uploaded individually in a threaded manner.
Here’s how the process unfolds:
- During the initial upload, the entire file must be transmitted as a whole (as there’s no way around this for the first upload).
- However, when it comes to updating the file, since there’s already a reference to the previously uploaded version, the file can be divided into smaller chunks. Only the portion that has been modified is then uploaded.
- For the download, we can download each chunk separately and regroup them at the end of the process.
By implementing this approach, the shift is made from requiring 2GB of bandwidth for both the initial upload and an update (a total of 4GB) to just 2GB for the initial upload and a mere 2MB for an update (assuming the file is divided into 100 chunks and a simple modification is made) and the same logic goes for the download.
Overall, with this method, we can :
- Reduce our bandwidth
- Create a fault-tolerant solution, since we only upload chunks that can be updated later (using a queue) if there’s any network issue
- Scale easily.
Rough calculations
Once the overall picture has been set, the next step is to perform some calculations. This helps in gauging the scale of our system and provides concrete insights into the types of components we’ll need to utilize. This is an important step that you’re interviewer will be expecting.
1. File storage
We’re dealing with 100 million users, each of whom has around 10 files. These files are, on average, about 10 megabytes in size. So, when you multiply these numbers together, we end up needing a total of 10,000 petabytes of file storage → 100M * 10 MB * 10 Files = 10,000 PB File storage.
2. Database storage
Database storage pertains to saving the essential information related to uploaded files, known as metadata. This includes the file name (with a maximum length of 100 characters, equivalent to 100 bytes), a timestamp (typically 13 bytes), and a URL (approximately 100 bytes).
While additional elements may be included, in most cases, we anticipate storing a maximum of 500 bytes. Using this estimate, the calculation stands as follows:
100 million users * 10 files per user * 500 bytes = 500 gigabytes of metadata.
3. Daily Bandwidth
Let’s assume, on average, that a user makes an update equivalent to a single page, which is roughly 2 kilobytes in size. The estimated calculation looks like this:
1 million active daily users * 100 edits * 2KB = 200 gigabytes of daily input bandwidth.
In total, we have approximately 200 gigabytes of daily updates. Furthermore, if we take into account that each update is downloaded, we also have 200 gigabytes of daily output bandwidth.
4. Database IOPS and throughput
In total, we have 2 daily operations being done on our database, which are write and read operations (when updating and when downloading). These two operations will in the worst case cost 500 bytes each and we have 100 updates daily for 1M DAU. Considering these pieces of information we can estimate our calculation.
Write IOPS = 100 * 1M / 86900 (number of seconds in a day) = 1150 / sec (approximately)
Read IOPS = 1150 / sec (if we consider each write will result in a read from the other side)
Adding these two together; the Total IOPS = 1150(approximately)
Throughput = 500 bytes (for read or write) * 1150 = 1,25 MB/s (approximately)
Database scaling & database choice
Given the scale we’re operating at, it’s evident that our database requires scaling. In such cases, we typically have two choices: implementing read replicas or utilizing sharding. The key question then becomes: which option should I choose?
Read Replicas: Introducing read replicas can effectively manage the read traffic. However, it doesn’t alleviate the challenge of high write loads, as with read replicas, writes are directed only to a master. While systems like Master-Master can be considered, they tend to be more challenging to maintain.
Sharding: In contrast, sharding is a viable solution for handling both high read and write traffic. If the need arises, we can further scale by adding additional shards. It’s worth noting that compared to read replicas, sharding is generally a more intricate process.
Ultimately, my choice for a scaling strategy will be sharding. This approach addresses my concerns regarding the high load we’re experiencing.
Database choice: I will opt for an SQL database in terms of database type. My primary focus is on achieving consistency, as it’s crucial for the files to be instantly accessible when sharing or querying from a different device, in contrast to the eventual consistency offered by NoSQL databases. Additionally, SQL databases are less intricate in terms of sharding, and there are numerous managed solutions available for this.
Designing the system
Designing the system is the final step, and personally, I consider it to be the least significant. What’s vital for an interviewer is to demonstrate a clear grasp of the system’s requirements and exhibit strong logical reasoning in extracting those requirements along with its components. In this final phase, there isn’t a definitive “right” solution; rather, there exist various approaches to implementing your system. This is why the preceding steps hold greater importance and significance compared to this one.
I opted to adopt a design proposed by interviewpen for this purpose.
The architecture is quite simple and can be outlined as follows:
- The user uploads their file via a message broker. This broker’s responsibility is to manage the high volume of file uploads and provide a fault tolerance mechanism in case of any network issues.
- The ingest service first checks whether the user is permitted to upload their file, either by having enough space or by making a payment. If the conditions are met, it sends the file in chunks to the S3 bucket and updates the database accordingly.
- Concerning the download server, its operation is fairly straightforward. It retrieves the data from the database and sends back the corresponding chunks to the client.
It’s worth noting that all of our components are designed to scale (1..n), and the database is also sharded.
Areas of improvements
While this architecture is uncomplicated and clear-cut, there are opportunities for refinement.
- Presigned URL: Rather than employing an ingest service focused on chunked file uploads to address the bandwidth concern, an alternative could be to utilize a pre-signed URL solution. This approach shifts the bandwidth issue to the user and introduces an additional layer of security.
2. Security: It’s important to note that security considerations are not addressed here, but they should certainly be a topic of discussion with your interviewer.
3. Caching: Additionally, one might contemplate implementing a Content Delivery Network (CDN) and integrating a cache layer to enhance user experience and reduce latency.
Conclusion
This is the first article in this system design series. I hope you’ve found the content insightful and engaging. Your comments and suggestions for future article topics are greatly encouraged and appreciated. Please feel free to share your thoughts! 🚀🚀🚀