TinyKV

TinyKV is a simple, lightweight, key-value store.

I was inspired by a friend who implemented an LSM key-value store, thought it was cool, and decided to give it a go as well. (Check out his blog post here.)

Considerations

After deciding to implement a key-value store, the next step was to ask myself how I would implement persistence. There are 2 key ways:

1) Storing in disk as B+ trees

  • Key idea: key-value pairs are stored as B+ trees, with each node in the tree pointing to the actual entry (or we can store the actual entry as well, given the relatively small size of each entry).
  • Good for read-heavy flows, as we only have to do disk reads in the order of O(logn), to get an entry.
  • Suffers some write amplification (increased latency for writes) due to updates being made in-place. For instance, in order to update/delete an entry, we'd have to traverse the tree, locate the entry, then update/delete it.

2) Storing in disk as SSTables (Sorted Set Tables)

  • Key idea: key-value pairs are stored in memory in a MemTable (maintains sorted order of entries). When full, flushed to disk as SSTables, which basically store these entries in sorted order. In order to locate an item in an SSTable, we can simply perform binary search, enabling us to locate an entry in O(logn) time.
  • Leveled Compaction: New SSTables are written to Level 0 of our LSM tree. As we progress down the levels, the entries get older. We have a compaction mechanism to merge entries across each level, so that newer entries overwrite older ones.
    • NOTE: the keys in each SSTable in Level 0 overlap. But SSTables in subsequent levels have no overlapping keys. This allows us to narrow our search space when searching in these levels, as we can use binary search to narrow down the particular SSTable whose key range contains our search key.
    • (insert diagram)
  • Good for write-heavy flows, as new writes are append-only. Updates to key-value pairs are written to a new SSTable to be persisted as-is, making writes super fast.

Naive Implementation

After reading up on the above implementations, it still felt a little too complex for me to implement. To kick things off, I implemented a naive, end-to-end, persistent key value store. It consisted of a few key components:

  1. In-memory write buffer - Enable fast reads for the most recent writes. When full, entries in here would be flushed to disk.
  2. DiskManager - Writes entries to disk. For this implementation, I did not sort the data; I simply flushed the entires as they were from the write buffer to a new file on disk.
  3. Entry - consists of { key: string, val: string } (NOTE: no tombstoning yet).

Analysis

  1. PUT operation - O(1), simply write to the in-memory write buffer.
  2. GET operation - O(n). In the worst case, we'd have to look through every file in disk to get the key, which takes O(n) time.
  3. DELETE operation - O(n).
    • Since I had no tombstoning, in order to wipe out the data completely. This meant deleting it from the write buffer if it exists, then deleting from disk. This process is super expensive, doesn't make much sense (why would we delete from both the in-memory component and from disk, instead of just marking it as deleted in the in-memory write buffer?)
  4. FLUSH operation - O(n)
    • Very expensive because for every key in the write buffer, I had to check if it existed in some file on disk, and overwrite it if it did.

TinyKV v1

Now that I had a rough idea of the end-to-end flow, it was time to implement the complex LSM-Tree based key-value store.

Key Components

  1. Entry - Represents a key-value pair, together with its tombstone status.
  2. MemTable - This is an in-memory component (currently implemented with an ordered map), storing the key-value pairs in sorted order.
  3. WAL - A Write-Ahead Log, which is an append-only log that we will write to each time we insert a key-value pair into the key-value store.
  4. SSTableManager - Manages all levels in our LSM Tree.
  5. LevelManager - Manages a particular level in our LSM Tree, consisting of multiple SSTables.
  6. SSTableFileManager - Manages a particular SSTable file, including writing to and reading from it.
  7. BloomFilter - Essential for minimizing disk reads. The bloom filter is a probabilistic data structure that tells us with 100% certainly if an entry does not exist in a particular SSTable. False positives are a possibility, but never false negatives. This greatly reduces the extent of disk reads, as we can skip tables whose Bloom Filters return false.

Challenges:

A key challenge I faced was trying to figure out how to implement compaction. The idea sounded ok, but implementing it was rather complicated. A few questions I needed to address:

  1. How would I organize files on disk to reflect each "level"?
  2. How would we merge files on Level 0 (containing overlapping keys) vs on other levels?

Things I learned:

  • Understood how an actual key value store worked under the hood and how a write-heavy key-value store was actually implemented.
  • Gained some insight as to how we can model a certain problem using OOP, and subsequently implement a working system with OOP.

Future Enhancements

  1. Test-writing to ensure robust code quality.
  2. Support concurrent writes - different threads calling DB.put(key, val) or DB.del(key, val) should run successfully without there being data races (eg. accessing the same entry in the memtable) and race conditions (eg. thread 1 overwriting data written by thread 2, etc.).