Dropbox is a cloud-based file storage service for storing and sharing files securely and reliably across devices.
Core functional requirements for the Dropbox system are defined, alongside out-of-scope items.
Users should be able to upload a file from any device.
Users should be able to download a file from any device.
Users should be able to share files with others and view shared files.
Users can automatically sync files across devices.
Users should not be able to edit files directly within the system.
Users should not be able to view files without downloading them first.
Designing Blob Storage itself is outside the scope of this problem, but researching it is suggested.
Key non-functional requirements for the system, including availability, latency, security, and reliability, are outlined.
The system should prioritize availability over consistency.
The system should support files as large as 50GB.
The system should be secure, reliable, and able to recover lost or corrupted files.
Upload, download, and sync times should be as fast as possible.
The system should not have a storage limit per user.
The system should not support file versioning.
The system should not scan files for viruses and malware.
For file storage, prioritizing availability over consistency is acceptable, unlike applications requiring immediate consistency.
A stock trading app requires consistency, meaning a buy transaction must be replicated globally before subsequent buys.
For Dropbox, it is acceptable if an uploaded file is not immediately visible globally for a few seconds.
The initial setup involves planning the design approach and defining core entities for the system.
The design strategy involves building sequentially through functional requirements, then using non-functional requirements for deep dives.
Defining primary entities early provides a foundation for the system's API and high-level design.
The File entity represents the raw data that users upload, download, and share.
FileMetadata includes information like file name, size, mime type, and uploader.
The User entity represents the system's users.
Defining the API early guides the high-level design, with endpoints for each functional requirement.
An initial endpoint for uploading a file might be POST /files with File and FileMetadata in the request.
An initial endpoint for downloading a file can be GET /files/{fileId} returning File & FileMetadata.
An initial endpoint for sharing a file might be POST /files/{fileId}/share with an array of User IDs.
An endpoint to query changes for syncing can be GET /files/changes?since={timestamp} returning ChangeEvent[].
Each ChangeEvent includes fileId, change type (created, updated, deleted), and updated metadata.
APIs may change or evolve during the design process, which should be communicated to the interviewer.
User authentication information (session token or JWT) should be passed in request headers for security.
Passing user information in the request body should be avoided as it can be manipulated by the client.
The high-level design aims to satisfy all functional requirements first, then layer in non-functional requirements.
Designing how users upload files from any device involves storing file contents and metadata.
File metadata can be stored in a NoSQL database like DynamoDB, which supports loosely structured data.
A basic schema includes id, name, size, mimeType, and uploadedBy fields.
The simplest approach is uploading files directly to a File Service backend server and storing them on its local file system.
This approach has scalability and reliability issues as file numbers grow and server failures occur.
A better approach is storing files in a Blob Storage service (e.g., Amazon S3, Google Cloud Storage) while metadata goes to the database.
Blob Storage handles scaling, offers high reliability, and provides features like lifecycle policies and versioning.
This approach is more complex, requiring integration with Blob Storage and handling transactional consistency between file and metadata uploads.
This approach redundantly uploads files twice: once to the backend and once to Blob Storage.
The best approach allows users to upload files directly to Blob Storage from the client using presigned URLs.
Direct upload is faster and cheaper, bypassing the backend server for file transfer.
Presigned URLs grant temporary permission to upload a file to a specific Blob Storage location.
The upload process becomes a three-step sequence involving requesting a URL, uploading, and notification.
Client requests a presigned URL from the backend, which saves file metadata with 'uploading' status.
Client uses the presigned URL for a PUT request to upload the file directly to Blob Storage.
Blob Storage sends a notification to the backend, which updates file metadata status to 'uploaded'.
Direct upload with presigned URLs is a classic pattern for efficient large file transfers, bypassing application servers for data transfer.
Designing how users download files from any device involves several approaches.
The most common solution involves downloading the file first from Blob Storage to the backend, then to the client.
This approach is suboptimal, leading to slower speeds and increased costs due to double downloads.
A better approach is allowing users to download files directly from Blob Storage using presigned URLs.
Client requests a presigned download URL from the backend, then uses it to download the file directly.
While nearly optimal, this approach can be slow for geographically distributed users due to single-region Blob Storage.
The best approach uses a CDN to cache files closer to users, reducing latency and speeding up downloads.
CDNs serve files from the closest server, significantly faster than direct backend or Blob Storage access.
For security, CDN signed URLs provide temporary, permission-based access for file downloads.
CDNs are expensive, requiring strategic caching policies for file caching duration and invalidation.
Cache control headers specify how long files should be cached, optimizing cost and performance.
Cache invalidation removes updated or deleted files from the CDN to ensure fresh content.
To support file sharing, the system needs an efficient mechanism to manage access for other users.
A simple approach is adding a list of users with access (sharelist) directly to the file metadata.
The file metadata schema would include a 'sharelist' field, e.g., 'sharelist': ['user2', 'user3'].
Retrieving files shared *with* a user is slow, requiring scanning every file's sharelist.
A better approach caches an inverse mapping from a user to the files shared with them, in addition to the sharelist.
A cache entry would look like 'user1': ['fileId1', 'fileId2'] for quick lookup.
The main challenge is keeping the cached sharedFiles list in sync with the sharelist in the file metadata.
The best way to overcome sync issues is updating both sharelist and sharedFiles list within a transaction.
Another approach fully normalizes data by creating a new SharedFiles table mapping userId to fileId.
The SharedFiles table has 'userId' (Partition Key) and 'fileId' (Sort Key) forming a composite primary key.
This design removes the need for a 'sharelist' in file metadata and eliminates sync issues.
This query is slightly less efficient due to index-based querying instead of a simple key-value lookup.
The trade-off of slightly less efficient queries is often worth eliminating the need to sync sharelists.
Automatic file synchronization across devices requires handling changes from local to remote and remote to local.
When a user updates a file locally, changes must sync to the remote server, considered the source of truth.
A client-side sync agent monitors local folder changes using OS-specific file system events.
Upon detecting a change, the agent queues the modified file for local upload.
The agent uses the upload API to send changes and updated metadata to the server.
Conflicts are resolved using a 'last write wins' strategy, saving the most recent edit.
Versioning, though out of scope, would typically add new chunks/files rather than overwriting the only file.
Clients need to detect and pull changes from the remote server to their local devices.
The client periodically queries the server for changes since its last sync, using `updatedAt` timestamps.
Polling is simple but can be slow to detect changes and wastes bandwidth if nothing has changed.
The server maintains an open connection (WebSocket or SSE) with each client to push real-time change notifications.
This approach is more complex but provides real-time updates.
A hybrid approach combines WebSocket/SSE for real-time updates with periodic polling as a safety net.
The server pushes change events in real-time through a single WebSocket connection per device/session.
Clients periodically poll (e.g., every few minutes) using GET /files/changes?since={timestamp} to catch missed changes.
This approach provides real-time updates and guarantees eventual consistency even with connection interruptions.
A holistic view of the system components satisfying all functional requirements.
The client (web, mobile, or desktop app) uploads files and proactively identifies and pushes local changes.
The client (potentially same as uploader) downloads files and determines when local files need remote updates.
Handles routing requests, SSL termination, rate limiting, and request validation for application servers.
Manages file metadata in the database and generates presigned URLs using the S3 SDK without direct file handling.
Stores file metadata (name, size, MIME type, uploader) and a shared files table for permissions enforcement.
Stores actual file contents, with direct uploads facilitated by presigned URLs from the file service.
Caches files globally to reduce latency; serves files from the nearest edge location using signed URLs.
CDN fetches files from S3 on a cache miss and serves from the edge on subsequent requests.
This section explores specific challenges and advanced solutions for the Dropbox system design.
Designing for large files requires addressing user experience and technical limitations of single requests.
Key user experience insights for large files include progress indicators and resumable uploads.
Uploading large files via a single POST request faces several limitations.
Web servers and clients have timeout settings, which a 50GB file upload can easily exceed.
A 50GB file at 100Mbps takes approximately 1.11 hours to upload.
Browsers and web servers, like Amazon API Gateway, often impose strict limits on request payload sizes.
Amazon API Gateway has a hard limit of 10MB for request payloads.
Large files are more susceptible to network interruptions, forcing uploads to restart from scratch.
Users lack progress visibility, not knowing upload status or estimated completion time.
Chunking breaks files into smaller pieces (e.g., 5-10 MB) for individual or parallel uploads.
Chunking must be done on the client side to effectively bypass server payload limitations.
Chunking allows tracking and updating a progress bar for each successfully uploaded chunk, improving UX.
Resumable uploads require tracking uploaded and remaining chunks, saving state in FileMetadata.
The FileMetadata schema includes a 'chunks' field, listing each chunk's ID and status (uploaded, uploading, not-uploaded).
The client uploads chunks to S3, then sends PATCH requests to the backend to update chunk statuses in FileMetadata.
This approach risks security and inconsistent states as a malicious client could fake chunk upload statuses.
A better approach implements server-side verification of chunk uploads using ETags and S3's ListParts API.
This approach balances user experience with data integrity by accepting client updates but periodically verifying server-side.
Resuming uploads requires uniquely identifying files and individual chunks.
A fingerprint (cryptographic hash like SHA-256) identifies file content for deduplication and resumability.
Generating fingerprints for each chunk allows precise identification of transmitted parts for resumable uploads.
The comprehensive process for uploading a large file with chunking and fingerprinting involves multiple steps.
The client chunks the file into 5-10MB pieces, calculating fingerprints for each chunk and the entire file.
Client checks if a file with the same fingerprint exists and is 'uploading' to resume the upload.
If new, client POSTs to initiate a multipart upload; backend gets an S3 uploadId, generates chunk presigned URLs, and saves metadata.
Client uploads each chunk to S3, then PATCHes backend with chunk status and ETag; backend verifies and updates metadata.
Once all chunks are uploaded, backend calls S3's CompleteMultipartUpload API, then updates file metadata to 'uploaded'.
Throughout the process, the client is responsible for tracking upload progress and updating the user interface.
Cloud storage providers like Amazon S3 offer a Multipart Upload feature that handles large objects in parts.
Candidates are expected to explain S3 Multipart Upload mechanics, not just state its use, to show understanding.
S3 event notifications only trigger when the entire multipart upload is completed, not for individual part uploads.
To track individual part progress, S3's ListParts API can be used, which returns uploaded parts and their ETags.
Chunked downloads are generally not needed as S3 assembles parts into a single object after multipart upload completion.
After assembly, downloads work like any normal file, using a single presigned or CDN signed URL.
For very large files, S3 and HTTP support Range requests, enabling parallel or resumable byte range downloads.
Optimizing uploads, downloads, and syncing involves several techniques beyond basic approaches.
CDNs cache files closer to the user, reducing latency and speeding up download times.
Chunking maximizes bandwidth utilization by sending multiple chunks in parallel and adjusting sizes.
For syncing, chunking allows only changed chunks to be transferred, significantly speeding up the process.
CDC uses rolling hashes to determine chunk boundaries based on content, making delta sync efficient for small edits.
Fixed-size chunking makes delta sync useless because a small edit shifts all subsequent chunk boundaries.
Systems like Dropbox use Rabin fingerprinting for CDC to achieve efficient delta sync.
Compression reduces file size, meaning fewer bytes need to be transferred, speeding up uploads and downloads.
Compression happens on the client before uploading to S3, and decompression happens on the client after downloading.
Client-side logic should decide whether to compress based on file type, size, and network conditions.
Media files like images and videos have low compression ratios, making compression often not worthwhile.
Text files can achieve high compression ratios, potentially reducing a 5GB file to 1GB or less.
Common compression algorithms include Gzip, Brotli, and Zstandard, each with tradeoffs in ratio and speed.
Gzip is widely used and broadly supported.
Brotli generally offers better compression ratios than Gzip, especially for text, and is supported by modern browsers.
Zstandard provides an excellent balance of speed and compression ratio, compressing and decompressing faster than Gzip.
Zstandard is a strong choice for client-side compression due to its fast compression speed.
Always compress files before encrypting them, as encryption introduces randomness that hinders compression.
Ensuring file security involves encryption in transit, encryption at rest, and robust access control.
Using HTTPS encrypts data transfer between client and server, a standard practice supported by modern browsers.
Encrypting files stored in S3 is a native feature; S3 encrypts files with unique keys stored separately.
The shareList or separate share table/cache serves as the basic Access Control List (ACL).
Download links are generated as signed URLs, valid only for a short period (e.g., 5 minutes).
Signed URLs are bearer tokens, meaning anyone with a valid, unexpired URL can download the file.
A short expiration window limits exposure but does not fully prevent unauthorized sharing.
Signed URLs are generated on the server, incorporating a signature, expiration timestamp, and optional restrictions.
The signed URL is distributed to an authorized user to access the resource directly from the CDN.
CDN verifies the signature, expiration, and restrictions; serves content if valid, denies access otherwise.
Expectations for system design interviews vary significantly based on candidate experience level (Mid-level, Senior, Staff+).
This comparison outlines the expected scope and depth of knowledge for Mid-level, Senior, and Staff+ candidates tackling the Dropbox system design problem.
| Level | Breadth vs Depth | Driving/Proactivity | Dropbox Problem Bar |
|---|---|---|---|
| Mid-level (E4) | 80% Breadth / 20% Depth | Drives early, interviewer probes basics and drives later stages | Defines API, data model, functional high-level design; reasons through probing questions. |
| Senior (E5) | 60% Breadth / 40% Depth | Proactive; anticipates challenges and suggests improvements | Quickly through high-level design; deep discussion on large files, multipart upload, trade-offs. |
| Staff+ (E6+) | 40% Breadth / 60% Depth | Exceptional proactivity; identifies and solves issues independently, interviewer only focuses | Deep dive into nuances, practical application of technologies, confident solutions from experience, treats interviewer as peer. |
Dropbox is a cloud-based file storage service for storing and sharing files securely and reliably across devices.
Core functional requirements for the Dropbox system are defined, alongside out-of-scope items.
Users should be able to upload a file from any device.
Users should be able to download a file from any device.
Users should be able to share files with others and view shared files.
Users can automatically sync files across devices.
Users should not be able to edit files directly within the system.
Users should not be able to view files without downloading them first.
Designing Blob Storage itself is outside the scope of this problem, but researching it is suggested.
Key non-functional requirements for the system, including availability, latency, security, and reliability, are outlined.
The system should prioritize availability over consistency.
The system should support files as large as 50GB.
The system should be secure, reliable, and able to recover lost or corrupted files.
Upload, download, and sync times should be as fast as possible.
The system should not have a storage limit per user.
The system should not support file versioning.
The system should not scan files for viruses and malware.
For file storage, prioritizing availability over consistency is acceptable, unlike applications requiring immediate consistency.
A stock trading app requires consistency, meaning a buy transaction must be replicated globally before subsequent buys.
For Dropbox, it is acceptable if an uploaded file is not immediately visible globally for a few seconds.
The initial setup involves planning the design approach and defining core entities for the system.
The design strategy involves building sequentially through functional requirements, then using non-functional requirements for deep dives.
Defining primary entities early provides a foundation for the system's API and high-level design.
The File entity represents the raw data that users upload, download, and share.
FileMetadata includes information like file name, size, mime type, and uploader.
The User entity represents the system's users.
Defining the API early guides the high-level design, with endpoints for each functional requirement.
An initial endpoint for uploading a file might be POST /files with File and FileMetadata in the request.
An initial endpoint for downloading a file can be GET /files/{fileId} returning File & FileMetadata.
An initial endpoint for sharing a file might be POST /files/{fileId}/share with an array of User IDs.
An endpoint to query changes for syncing can be GET /files/changes?since={timestamp} returning ChangeEvent[].
Each ChangeEvent includes fileId, change type (created, updated, deleted), and updated metadata.
APIs may change or evolve during the design process, which should be communicated to the interviewer.
User authentication information (session token or JWT) should be passed in request headers for security.
Passing user information in the request body should be avoided as it can be manipulated by the client.
The high-level design aims to satisfy all functional requirements first, then layer in non-functional requirements.
Designing how users upload files from any device involves storing file contents and metadata.
File metadata can be stored in a NoSQL database like DynamoDB, which supports loosely structured data.
A basic schema includes id, name, size, mimeType, and uploadedBy fields.
The simplest approach is uploading files directly to a File Service backend server and storing them on its local file system.
This approach has scalability and reliability issues as file numbers grow and server failures occur.
A better approach is storing files in a Blob Storage service (e.g., Amazon S3, Google Cloud Storage) while metadata goes to the database.
Blob Storage handles scaling, offers high reliability, and provides features like lifecycle policies and versioning.
This approach is more complex, requiring integration with Blob Storage and handling transactional consistency between file and metadata uploads.
This approach redundantly uploads files twice: once to the backend and once to Blob Storage.
The best approach allows users to upload files directly to Blob Storage from the client using presigned URLs.
Direct upload is faster and cheaper, bypassing the backend server for file transfer.
Presigned URLs grant temporary permission to upload a file to a specific Blob Storage location.
The upload process becomes a three-step sequence involving requesting a URL, uploading, and notification.
Client requests a presigned URL from the backend, which saves file metadata with 'uploading' status.
Client uses the presigned URL for a PUT request to upload the file directly to Blob Storage.
Blob Storage sends a notification to the backend, which updates file metadata status to 'uploaded'.
Direct upload with presigned URLs is a classic pattern for efficient large file transfers, bypassing application servers for data transfer.
Designing how users download files from any device involves several approaches.
The most common solution involves downloading the file first from Blob Storage to the backend, then to the client.
This approach is suboptimal, leading to slower speeds and increased costs due to double downloads.
A better approach is allowing users to download files directly from Blob Storage using presigned URLs.
Client requests a presigned download URL from the backend, then uses it to download the file directly.
While nearly optimal, this approach can be slow for geographically distributed users due to single-region Blob Storage.
The best approach uses a CDN to cache files closer to users, reducing latency and speeding up downloads.
CDNs serve files from the closest server, significantly faster than direct backend or Blob Storage access.
For security, CDN signed URLs provide temporary, permission-based access for file downloads.
CDNs are expensive, requiring strategic caching policies for file caching duration and invalidation.
Cache control headers specify how long files should be cached, optimizing cost and performance.
Cache invalidation removes updated or deleted files from the CDN to ensure fresh content.
To support file sharing, the system needs an efficient mechanism to manage access for other users.
A simple approach is adding a list of users with access (sharelist) directly to the file metadata.
The file metadata schema would include a 'sharelist' field, e.g., 'sharelist': ['user2', 'user3'].
Retrieving files shared *with* a user is slow, requiring scanning every file's sharelist.
A better approach caches an inverse mapping from a user to the files shared with them, in addition to the sharelist.
A cache entry would look like 'user1': ['fileId1', 'fileId2'] for quick lookup.
The main challenge is keeping the cached sharedFiles list in sync with the sharelist in the file metadata.
The best way to overcome sync issues is updating both sharelist and sharedFiles list within a transaction.
Another approach fully normalizes data by creating a new SharedFiles table mapping userId to fileId.
The SharedFiles table has 'userId' (Partition Key) and 'fileId' (Sort Key) forming a composite primary key.
This design removes the need for a 'sharelist' in file metadata and eliminates sync issues.
This query is slightly less efficient due to index-based querying instead of a simple key-value lookup.
The trade-off of slightly less efficient queries is often worth eliminating the need to sync sharelists.
Automatic file synchronization across devices requires handling changes from local to remote and remote to local.
When a user updates a file locally, changes must sync to the remote server, considered the source of truth.
A client-side sync agent monitors local folder changes using OS-specific file system events.
Upon detecting a change, the agent queues the modified file for local upload.
The agent uses the upload API to send changes and updated metadata to the server.
Conflicts are resolved using a 'last write wins' strategy, saving the most recent edit.
Versioning, though out of scope, would typically add new chunks/files rather than overwriting the only file.
Clients need to detect and pull changes from the remote server to their local devices.
The client periodically queries the server for changes since its last sync, using `updatedAt` timestamps.
Polling is simple but can be slow to detect changes and wastes bandwidth if nothing has changed.
The server maintains an open connection (WebSocket or SSE) with each client to push real-time change notifications.
This approach is more complex but provides real-time updates.
A hybrid approach combines WebSocket/SSE for real-time updates with periodic polling as a safety net.
The server pushes change events in real-time through a single WebSocket connection per device/session.
Clients periodically poll (e.g., every few minutes) using GET /files/changes?since={timestamp} to catch missed changes.
This approach provides real-time updates and guarantees eventual consistency even with connection interruptions.
A holistic view of the system components satisfying all functional requirements.
The client (web, mobile, or desktop app) uploads files and proactively identifies and pushes local changes.
The client (potentially same as uploader) downloads files and determines when local files need remote updates.
Handles routing requests, SSL termination, rate limiting, and request validation for application servers.
Manages file metadata in the database and generates presigned URLs using the S3 SDK without direct file handling.
Stores file metadata (name, size, MIME type, uploader) and a shared files table for permissions enforcement.
Stores actual file contents, with direct uploads facilitated by presigned URLs from the file service.
Caches files globally to reduce latency; serves files from the nearest edge location using signed URLs.
CDN fetches files from S3 on a cache miss and serves from the edge on subsequent requests.
This section explores specific challenges and advanced solutions for the Dropbox system design.
Designing for large files requires addressing user experience and technical limitations of single requests.
Key user experience insights for large files include progress indicators and resumable uploads.
Uploading large files via a single POST request faces several limitations.
Web servers and clients have timeout settings, which a 50GB file upload can easily exceed.
A 50GB file at 100Mbps takes approximately 1.11 hours to upload.
Browsers and web servers, like Amazon API Gateway, often impose strict limits on request payload sizes.
Amazon API Gateway has a hard limit of 10MB for request payloads.
Large files are more susceptible to network interruptions, forcing uploads to restart from scratch.
Users lack progress visibility, not knowing upload status or estimated completion time.
Chunking breaks files into smaller pieces (e.g., 5-10 MB) for individual or parallel uploads.
Chunking must be done on the client side to effectively bypass server payload limitations.
Chunking allows tracking and updating a progress bar for each successfully uploaded chunk, improving UX.
Resumable uploads require tracking uploaded and remaining chunks, saving state in FileMetadata.
The FileMetadata schema includes a 'chunks' field, listing each chunk's ID and status (uploaded, uploading, not-uploaded).
The client uploads chunks to S3, then sends PATCH requests to the backend to update chunk statuses in FileMetadata.
This approach risks security and inconsistent states as a malicious client could fake chunk upload statuses.
A better approach implements server-side verification of chunk uploads using ETags and S3's ListParts API.
This approach balances user experience with data integrity by accepting client updates but periodically verifying server-side.
Resuming uploads requires uniquely identifying files and individual chunks.
A fingerprint (cryptographic hash like SHA-256) identifies file content for deduplication and resumability.
Generating fingerprints for each chunk allows precise identification of transmitted parts for resumable uploads.
The comprehensive process for uploading a large file with chunking and fingerprinting involves multiple steps.
The client chunks the file into 5-10MB pieces, calculating fingerprints for each chunk and the entire file.
Client checks if a file with the same fingerprint exists and is 'uploading' to resume the upload.
If new, client POSTs to initiate a multipart upload; backend gets an S3 uploadId, generates chunk presigned URLs, and saves metadata.
Client uploads each chunk to S3, then PATCHes backend with chunk status and ETag; backend verifies and updates metadata.
Once all chunks are uploaded, backend calls S3's CompleteMultipartUpload API, then updates file metadata to 'uploaded'.
Throughout the process, the client is responsible for tracking upload progress and updating the user interface.
Cloud storage providers like Amazon S3 offer a Multipart Upload feature that handles large objects in parts.
Candidates are expected to explain S3 Multipart Upload mechanics, not just state its use, to show understanding.
S3 event notifications only trigger when the entire multipart upload is completed, not for individual part uploads.
To track individual part progress, S3's ListParts API can be used, which returns uploaded parts and their ETags.
Chunked downloads are generally not needed as S3 assembles parts into a single object after multipart upload completion.
After assembly, downloads work like any normal file, using a single presigned or CDN signed URL.
For very large files, S3 and HTTP support Range requests, enabling parallel or resumable byte range downloads.
Optimizing uploads, downloads, and syncing involves several techniques beyond basic approaches.
CDNs cache files closer to the user, reducing latency and speeding up download times.
Chunking maximizes bandwidth utilization by sending multiple chunks in parallel and adjusting sizes.
For syncing, chunking allows only changed chunks to be transferred, significantly speeding up the process.
CDC uses rolling hashes to determine chunk boundaries based on content, making delta sync efficient for small edits.
Fixed-size chunking makes delta sync useless because a small edit shifts all subsequent chunk boundaries.
Systems like Dropbox use Rabin fingerprinting for CDC to achieve efficient delta sync.
Compression reduces file size, meaning fewer bytes need to be transferred, speeding up uploads and downloads.
Compression happens on the client before uploading to S3, and decompression happens on the client after downloading.
Client-side logic should decide whether to compress based on file type, size, and network conditions.
Media files like images and videos have low compression ratios, making compression often not worthwhile.
Text files can achieve high compression ratios, potentially reducing a 5GB file to 1GB or less.
Common compression algorithms include Gzip, Brotli, and Zstandard, each with tradeoffs in ratio and speed.
Gzip is widely used and broadly supported.
Brotli generally offers better compression ratios than Gzip, especially for text, and is supported by modern browsers.
Zstandard provides an excellent balance of speed and compression ratio, compressing and decompressing faster than Gzip.
Zstandard is a strong choice for client-side compression due to its fast compression speed.
Always compress files before encrypting them, as encryption introduces randomness that hinders compression.
Ensuring file security involves encryption in transit, encryption at rest, and robust access control.
Using HTTPS encrypts data transfer between client and server, a standard practice supported by modern browsers.
Encrypting files stored in S3 is a native feature; S3 encrypts files with unique keys stored separately.
The shareList or separate share table/cache serves as the basic Access Control List (ACL).
Download links are generated as signed URLs, valid only for a short period (e.g., 5 minutes).
Signed URLs are bearer tokens, meaning anyone with a valid, unexpired URL can download the file.
A short expiration window limits exposure but does not fully prevent unauthorized sharing.
Signed URLs are generated on the server, incorporating a signature, expiration timestamp, and optional restrictions.
The signed URL is distributed to an authorized user to access the resource directly from the CDN.
CDN verifies the signature, expiration, and restrictions; serves content if valid, denies access otherwise.
Expectations for system design interviews vary significantly based on candidate experience level (Mid-level, Senior, Staff+).
This comparison outlines the expected scope and depth of knowledge for Mid-level, Senior, and Staff+ candidates tackling the Dropbox system design problem.
| Level | Breadth vs Depth | Driving/Proactivity | Dropbox Problem Bar |
|---|---|---|---|
| Mid-level (E4) | 80% Breadth / 20% Depth | Drives early, interviewer probes basics and drives later stages | Defines API, data model, functional high-level design; reasons through probing questions. |
| Senior (E5) | 60% Breadth / 40% Depth | Proactive; anticipates challenges and suggests improvements | Quickly through high-level design; deep discussion on large files, multipart upload, trade-offs. |
| Staff+ (E6+) | 40% Breadth / 60% Depth | Exceptional proactivity; identifies and solves issues independently, interviewer only focuses | Deep dive into nuances, practical application of technologies, confident solutions from experience, treats interviewer as peer. |