Leaderboard System Design
Build leaderboard database & system design including redis leaderboard system
Quick Links:
Opening
There is announcement before we start the article. We will regularly release our newsletter weekly in tuesday starting for next week instead of weekend. Make sure you’re not miss out from our newsletter, So don’t forget to subscribe and share this newsletter. If you have specific topic or want to give us feedback, feel free to fill this form: form link
The topic of database and system design is back. One of my favorite phases in an interview is the system design interview. I love this interview because I can assess 75% of the skills in the first 10 minutes. How you design the database, explain things, and ask questions will define your skills.
One of the many problems that I often use for system design interviews is the Leaderboard System Design. A leaderboard system is a feature commonly used in competitive or gamification environments such as video games, sports events, or other activities where individuals or teams compete for rankings based on their performance. Leaderboard systems can be implemented in various ways, for example, ranked based on points earned, win-loss records, or other statistical measures. Another characteristic of a leaderboard system is the display of the top N highest scores, faster completions, or other performance metrics.
The leaderboard use case is interesting because it can be a very simple case and the requirements can grow over time. So, this use case is suitable for almost all engineering levels, from fresh graduates to technical leadership and management. In this article, we are going forward with a practical use case. Let's start with the problem statement.
Problem Statement
Suppose you are asked to build a leaderboard system design for a question-answer website like Quora. Every time a user answers a question, they receive points. Please note that concerns like "what if a user engages in spammy answers" are outside the scope of this discussion.
So, here are the requirements:
Display the top 10 monthly leaderboards on the homepage of the website.
Provide the ability to view past leaderboards (for example, currently in May 2023, users should be able to check the standings from December 2022).
Show the ranking of current authenticated user. If a user's rank is 100 or higher, display "99+" instead of the actual rank. However, if the rank is below 100, display the actual rank.
Please consider the traffic implications. Due to the feature being placed on the homepage, there is a possibility that this feature will experience high levels of traffic.
#1 System Design: Leaderboard with Single Score Table
In my experience, 90% of successful software engineers who answer this question will propose this design in their first attempt. The idea is to have a single leaderboard table that corresponds 1:1 with the source of score data, in this case, the answer table
.
Every time there is an insertion into the answer table, scoring calculation must be performed and the results should be inserted into the leaderboard table.
With this database design schema, obtaining the top ten leaderboards is achieved by simply performing a query with GROUP BY. Let's assume we want to retrieve the top ten leaderboards in May 2023. Here is the query:
-- Notes: The query below is written using MySQL syntax. If you are using PostgreSQL, you may need to adjust the YEAR and MONTH statements accordingly.
SELECT user_id, SUM(score) as total_score
FROM leaderboard
WHERE YEAR(created_at) = 2023
AND MONTH(created_at) = 5
GROUP BY user_id
ORDER BY total_score DESC
LIMIT 100
There is a requirement that if an authenticated user's rank is 100 or higher, we should display "99+". That's why we need to use a LIMIT of 100 instead of 10, as the maximum rank we want to retrieve is 100. If the user is not within the top 100, an additional query will be needed.
SELECT COALESCE(SUM(score), 0) as total_score
FROM leaderboard
WHERE YEAR(created_at) = 2023
AND MONTH(created_at) = 5
AND user_id = 1
While this design is simple, it will encounter scalability issues as the data and traffic scale. For instance, let's consider a scenario where you have 1 million users. Just imagine that every time a user hits your homepage, you would need to query aggregations from 1 million users. While the past month leaderboard can be cached since it remains committed and unchanged, the current month's leaderboard requires real-time updates to reflect changes in rankings. This can lead to performance challenges and potential bottlenecks.
#2 System Design: Leaderboard Score with Cumulative Table
The first solution encountered scalability issues. So, how can we improve it? The root cause of the problem lies in the SUM query, which becomes more resource-intensive as the data grows larger. To address this, we can focus on eliminating the need for the SUM operation. One possible solution is to introduce another table to cumulatively track the leaderboard scores.
In this design, we introduce two tables: leaderboard_detail
, which is the same as in the first solution, and leaderboard_summary
, which is a new table that stores the cumulative score for each year, month, and user_id. That's why we need a UNIQUE index for it.
This database schema may not strictly adhere to the normal forms of database design, but for this particular use case, it works effectively. You might have a question about why we don't just have the cumulative table alone (leaderboard_summary
table)? I have interviewed many engineers, and some of them proposed that design, but they struggled to answer when I asked about “how to ensure data integrity, especially in cases where there may be issues during score insertion”. Therefore, it's necessary to have the details in the leaderboard_detail
table so that we can recalculate the summary/cumulative scores if needed.
With the introduction of the leaderboard_summary
table, there are additional actions required when inserting the score into the database.
So, after the score calculation process, instead of just inserting into the leaderboard_detail
table, we need to also insert into the leaderboard_summary
table. If the entry for a specific year, month, and user_id already exists, we simply need to update the score for that entry.
Now it's easier to retrieve the top ten leaderboards by querying the leaderboard_summary
table. Let's assume we want to retrieve the top ten leaderboards for May 2023. Here is the query:
SELECT *
FROM leaderboard_summary
WHERE year = 2023
AND month = 5
ORDER BY total_score DESC
LIMIT 100
Again, if the authenticated user is not within the top 100, we can simply perform another query like this:
SELECT COALESCE(total_score, 0) as total_score
FROM leaderboard_summary
WHERE year = 2023
AND month = 5
AND user_id = 1
With this design, we eliminate the need for the GROUP BY and SUM statements, making this query much more efficient than the first solution.
#3 System Design: Redis Leaderboard System
The second solution appears to be good and scalable up to a certain point. However, I still don't feel entirely satisfied with the solution. The leaderboard_summary
table feels more like a view table, and I don't believe it needs to be persisted permanently. There are scenarios where the table can be recalculated. While this is a great feature, the idea of deleting all data and refilling it with new data makes me question the need for a relational database management system (RDBMS) like MySQL or PostgreSQL to store such data.
What if we replace the cumulative table with something simpler, like an in-memory database such as Redis?
We can leverage the ZINCR command in Redis to increment a user's score. It works like an upsert operation, meaning we don't need to handle the data if it doesn't exist yet. Redis utilizes sorted sets, represented by the "Z" keyword, which store data as a combination of keys and scores. Think of it as a map[string]number
, and the data is stored in a sorted order. Consequently, when querying the data, we can easily retrieve it in a sorted manner based on the total score.
To retrieve the top 10 leaderboards, we can use the following statements:
// Get Top 10, Rev means descending sort
ZRANGE leaderboard:2023:05 0 9 REV
// In the older redis version we need to use ZREVRANGE
ZREVRANGE leaderboard:2023:05 0 9
// To get authenticated user score
ZREVRANK leaderboard:2023:05 {user_id} WITHSCORE
With Redis, we don't need to retrieve 100 datas (0 9 => means 10 datas), as we can always obtain the rank and scores for each member using the ZRANK or ZREVRANGE statements. In this use case, we can utilize ZREVRANGE since it provides the rank sorted by the total score in descending order.
Based on my rough estimation, each member will consume approximately 80 bytes of memory. Therefore, if we have 1 million users, it would consume around 80MB of memory. With 1GB of RAM, we can store approximately 12 million members. I can confidently say that it is quite safe to store this data in Redis. We just need to scale it vertically if we require more RAM or just delete the old leaderboard that rarely opened.
Summary
The leaderboard system may seem simple at first glance, but it becomes more complex when scalability is required. The design heavily depends on the specific use case, including the amount of data to be stored and the expected traffic. Personally, I find the first solution to be a good starting point for implementing the leaderboard feature. However, as things scale up, I prefer to switch to the third solution, which involves using Redis as the leaderboard system.
Thank you for reading today's newsletter! If you find it valuable, here are some actions you can take:
1) ✉️ Subscribe — if you aren’t already, consider becoming a paid subscriber.
2) ❤️ Share — you can help spread the word by sharing the article with your team or someone who might find it useful!