LSM🌲Go and Badger

Go and Badger#

Last time we talked about how LSM🌲 has gained a place in the competitive world of B🌲 databases on HDD due to its excellent write performance and acceptable read performance after optimization. New databases have turned to LSM🌲 related storage backends because of the good learning and writing capabilities, as well as good concurrency and other reasons.

Continuing from the previous discussion, since HDD performs random access slower than sequential access by 100 times...

But, there is a revolutionary invention~!

It is the storage medium SSD, which has been constantly increasing in price and reducing in lifespan!

From the previous article, we learned that LSM🌲 is fast mainly for the following reasons:

New entries go to memory first.
Sequential writes to disk.
Periodic disk compaction.

The third step determines the speed of reading cold data from the disk, but the cost is that the data must be divided into levels. For example, 10M is one level, and 100M is another level. The process of increasing the level while continuously compacting the smaller level is called "compaction". The upgrade formula for most LSM is as follows:

sizeof( L+1 ) = 10 * sizeof( L )

When a level cannot accommodate the information stored in it, it needs to be compacted with a larger level, which is called "write amplification" (after all, it has been written multiple times).

Similarly, when searching for a key, if the key is stored on different levels, it needs to be searched multiple times, which is called "read amplification" (after all, it has been read multiple times).

Therefore... the optimization space lies in using the efficient random read performance of SSD to optimize LSM🌲 and reduce read amplification and write amplification.

However, the write amplification step significantly shortens the lifespan of SSD, so we need to reduce write amplification when utilizing the efficient random read capability.

The Emergence of Badger#

Founder: I am a Go user, and I want to write a DB. I looked at RocksDB, it's okay, but why do I have to use CGo? Once I use CGo, Goroutine is slowed down by pthread and becomes 💩. Oh, why is there still memory leakage?! Damn C (I agree with this), I will rewrite it in Go (I agree with this too, after all, the developers of RocksDB don't even know how to write CMake).

Badger only provides the following functions:

Get, Set, Delete, and Iterate functions. On top of it, it adds CompareAndSet and CompareAndDelete atomic operations.

Compared to RocksDB, Badger has considered transactions from the beginning, making writing to the database simpler.

Differences between Badger and LSM🌲 on HDD#

Separate the KV pairs, store the Key on an LSM🌲, and store the Value on a Write-Ahead Log (WAL). This reduces the size of the LSM🌲 to half of RocksDB, thereby reducing the performance loss of write amplification. The second point is that it directly retrieves the value from the WAL using the random read capability of SSD.

Implementation Details#

Separate Storage for KV#

Since the main cost of the original LSM🌲 is "compaction", which requires offline sorting, iteration is involved. However, usually only the Key is sorted, and the Value does not change during this process. Therefore, Badger separates the storage of Key and Value, reducing the memory wasted on sorting a large amount of useless Value.
Secondly, Badger stores the location of the Value using delta encoding, also known as pointers, on the right side of the Key. This tells the SSD the location of the Value corresponding to the Key. At the same time, it improves the performance of caching by limiting the size of individual files. Assuming a Key is 16 bytes and a pointer is 16 bytes, more than 2 million rows of KV can be stored in a 64MB file.

Reducing Write Amplification#

Because the size of the Value stored in the LSM🌲 is reduced, the Value part does not contribute to write amplification and does not move around with the Key.
If we assume the Value size is 1kb and the Key size is still 16 bytes, then the write amplification is only

(10*16 + 1024)/(16 + 1024) ~= 1.14

which means it is only slightly amplified.
This is reflected in the performance, where the write speed is N times faster than RocksDB, and N depends on the time it takes for the Value to be written to the disk.

Reducing Read Amplification#

Because the Value is not stored in the LSM, the size of the LSM is greatly reduced. This also means that there are fewer levels, and more data can be stored in memory. Through benchmarking, it was found that when RocksDB had a fixed value size of 1kb and 22 million rows of data, it used 72GB of data set on the LSM tree, while Badger only stored the Key and offset, resulting in only 1.7GB, all in memory...

Therefore, the read speed is 3.5 times faster than RocksDB...
The iteration speed... is 3.5*N.

The benchmark results are as follows#

Due to limited space#

For more information, please refer to

Next time... maybe we will interview Lao Chi to see what advanced features our Agate has.

Translation: