Random Thoughts on LSM🌲

35121759-6490fb0a-fc51-11e7-9dfd-dde9e2d09765-1024x423.png

Recently, I joined PingCAP and I have no understanding of the work I am doing, so I had to catch up on my knowledge of databases and storage. However, in my opinion... not understanding is not necessarily a bad thing. Firstly, there is a huge room for growth, and secondly, having a different perspective on problems may result in some new ideas~

Now, let's get to the point. Today, with the limited technical knowledge I have accumulated in the past few days, I will briefly discuss LSM🌲...

I don't know when it started, but it seems that a company with the 404 error on their homepage created Big Table, and they used a paper. This paper proposed a new approach without B-trees, where the write efficiency is higher than the read efficiency. This data structure, which may not sound reliable, actually outperformed traditional databases like MySQL. Unfortunately, Big Table is closed-source... However! Soon after, the original author open-sourced a single-node database written based on this algorithm, called LevelDB, which ignited the ecosystem. Various distributed storage systems followed suit. With the creation of RocksDB by Facebook (which is the project I am currently working on, and I must admit, my coding skills are not great), a new generation of storage engines for distributed databases was established. Our company's TiKV is also built on this. So today, let me show you what this data structure is all about...

So, what does this 🌲 do?#

Firstly, the birth of any technology is limited by the era it is in. In the era we are currently in, servers cannot afford high-throughput random read-write SSDs, and there is no storage medium that can achieve high-throughput random read-write without performance degradation. Therefore, in this era, the efficiency of batch sequential read-write is more than 20 times higher than random read-write (this is not made up data). To accelerate the speed of writing data to storage media, sequential read-write must be used.

Thus, LSM🌲 was born!#

LSM🌲 stands for Log-Structured Merge Tree.

In simple terms, data is written into LSM🌲 just like writing a log, ensuring sequential writing. However, this brings a huge problem. We know that a log is like a linked list, and the complexity of random access in a linked list without an index structure is O(N). How can we use this?! This problem will be addressed when we discuss reading later.

As mentioned earlier, LevelDB by the company with the 404 error uses LSM🌲 in conjunction with SSTable.

SSTable stands for Sorted-String Table.

This is how the company with the 404 error describes it:

An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings.

Therefore, we know that LSM🌲 in LevelDB is used in conjunction with a persistent, ordered, and immutable key-value storage structure, which stores a key and the offset of the data. With the offset and mmap, we can instantly find the information stored in our LSM🌲. Awesome!

So, the writing process of LSM🌲 is as follows:

Someone tells me that they want to write data, so I log it to avoid cracking while writing.
The data is written into the SSTable in memory (it could also be deleting data, but since we log it, this is equivalent to writing "he died"), and it is also written into a table in memory.
When the table in memory becomes large enough, a new table is created and the old table becomes read-only.
At the appropriate time, the old table is written to the hard disk.
As more and more small tables accumulate on the hard disk, a merge is performed at the appropriate time (such as in the middle of the night) to create a large table. During the merge process, data marked as "he died" is deleted. Since these tables are ordered, the merge process is a simple offline merge sort.

So, the reading process is straightforward:

Check if the data is in memory. - Very fast
If not, check if it is in the recently stored small tables.
If still not found, check if it is in the larger tables.
Repeat this process.

So... why does it seem like reading is still slow in the worst case, with a complexity of O(N)?!

Now, this is the focus of today's discussion. Take out your notebooks📓, but there won't be a test.

Reading Optimization#

🔴⚫🌲#

Since the author claims to be able to write a red-black tree in 20 lines of Haskell and says that you are not good enough if you can't do it, I won't discuss it.

HashTable#

// This is a recursive function
GOTO 0x00;

Skip List#

Alright, congratulations🎉, you have reached the optimization that most storage engines use, the skip list.

level 2: [11]                [21]                [41]                [70]
level 1: [11]      [14]      [21]      [31]      [41]      [58]      [70]
level 0: [11] [12] [14] [15] [21] [24] [31] [39] [41] [49] [58] [61] [70]

A skip list has N levels. Since our LSM🌲 is an ordered data structure, theoretically, we can achieve a reading speed close to O(log N).

You probably have already written something similar in your mind if you are a reader of my blog, so let's briefly discuss how to balance the space wasted and the increased time by generating the number of levels in the skip list.

Firstly, in real-life databases, the amount of data being processed is very large, so it is not feasible to construct the skip list by simply adding a level for every two nodes. This would result in a huge waste of space. Therefore, we use a clever probabilistic strategy:

Flipping a Coin

Since the probability of getting heads or tails when flipping a coin is exactly 50%, it perfectly matches the logarithmic nature of the skip list~

int atWhichLevelDoesThisNodeHaveCache(){
    var level = 0;
    while ( true ){
        var randomNumber = random.nextInt();
        if ( randomNumber % 2 == 0 ) {
            level += 1;
        } else {
            break;
        }
    }
    return level;
}

Although this may seem unfair and may result in some nodes being accessed more frequently but not being indexed, it is relatively stable based on probability.

Coming soon: I might discuss the SSD optimization of the project Badger, which AgateDB, a project inspired by the legendary Badger, implemented, and how it outperforms RocksDB. Stay tuned~