Skip to main content

Pastebin System Design

Functional Requirements of the System.

➡️ Durability : The system should be highly reliable. Texts created by users should never be lost.
➡️ High Availability : The system needs to be highly available. Meaning that it needs to be able to handle a large number of requests at the same time.
➡️ Low Latency : User should be able post text and get it back in a timely manner.


Non-Functional Requirements of the System.

➡️ Paste size should not exceed some memory say 10MB.
➡️ Custom Path URL
➡️ Paste Expiry : Lets say 6 months.
➡️ User Login / Anonymous


Capacity Estimation

Assumptions

EstimationPastesReadsPaste Size
Per Day1M10 * 1MMax 10MB

As per the above table we can say that the system can handle 100K pastes per day and can handle 10MB per paste. And also to notice that the system is Ready heavy, as user might access paste url more than the user actually posts.

Traffic

Writes : 1MReads : 10 * 1M
1M Per Day would be ~12 Requests per second.10 * 1M pastes would be ~120 Requests per second.

➡️ Along with this you should always keep a buffer of 30%. Meaning that the system should be able to handle either 30% less or 30% more requests than the actual traffic.

Data Estimation

❓How much data you will need to for your system to run smoothly over a period of time?

EstimationPastesPaste Size MaxPaste Size AverageTotal Data Required - AverageTotal Data Required - Max
Per Day1M10MB100KB1M * 100KB = ~100 GB1M * 10MB = ~ 1TB
Per Month1M * 3010MB100KB1M * 100KB = ~3TB1M * 10MB * 30 = ~ 30TB
Per Year1M * 36510MB100KB1M * 100KB = ~36.5TB1M * 10MB = ~ 365TB

❓Which database will you be using for the system. (Between NoSQL and SQL).

❓What is the strategy for storing the text for the system?

➡️ So currently we have is two types of consideration for the size of the text (average and max).
➡️ So storing 100KB of text directly into the database is not a problem and its not even a burden on the database.
➡️ But if we want to store a lot of text, then we need to store it in a way that it can be accessed easily. For that we can use is S3 bucket to store it as Blob and in our DB we can store the reference of the Blob.
➡️ On top of it we can have a Hybrid approach that some memory data can be stored in the DB and the rest can be stored in S3. This way we will also be able to preview the file and the user wont feel the lag which the full file is loaded in the background.

❓What is the strategy for caching of the requests?

➡️ As discussed in the earlier point, Hybrid approach can be good for accessing data in a better way, eliminating the lag of the data.
➡️ But we need to be careful with the caching strategy, as big chunk of data can be cached in the memory but the problem would be burden on the DB and in the cache memory we cannot put so much of data for any user.

❓What is Blob?

➡️ BLOB stands for a Binary Large Object, a data type that stores binary data. Binary Large Objects (BLOBs) can be complex files like text, images or videos, unlike other data strings that only store letters and numbers. A BLOB will hold multimedia objects to add to a database; however, not all databases support BLOB storage.

Database Schema

Paste Table

ColumnTypeDescription
idINTPrimary Key
contentTEXTText Content
sizeINTSize of the text
s3BucketURLTEXTS3 Bucket URL
created_atDATETIMEDate and Time of creation
expires_atDATETIMEDate and Time of expiry of the Paste

User Table

ColumnType
idINT
nameTEXT
created_atDATETIME

References