Although I've never used Riak, have been a distant fan just because it's written in Erlang. Erlang, systems that never stop! ® In one of Basho's whitepapers they mention the use of the Log-structured Merge Tree (LSM-tree) data structure for fast indexing. So what's a LSM-tree? It's a "disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period." Riak is often used in write heavy environments so it's important that indexing is fast. So what's a LSM-tree again? Hmmm, LSM-trees is inspired by the Log-structured File system (LSF), so I better first learn a little more about LSF.
The driving force behind the Ousterhout and Rosenblum's Log-structured File system was (is) the mechanical limitations of the disk drive. Unlike processor or memory, disk drives have mechnical moving parts and is governed by the laws of Newtonian physics. To read or write to disk the arm first has to move to the desired track, then there's a rotational delay until the disk spins to the relevant sector. This access time is in the milliseconds, which is an eternity compared to memory speed or processor cycles. Access time overhead is exasperated when the workload is frequent, small reads and writes. More (relative) time is spent moving the the disk head around than actual data transfer.
[Aside. Slow disk drives is one of the reasons I prefer to develop on desktops and not laptops. You get a fancy new MacBook Pro with the latest processor and a shit load of RAM only to be bounded by I/O. Money is better spent on the fastest disk drive you can buy.]
The situation for reads is "easily" solved with file cache. More memory, bigger caches, better hit rates, less read requests will have to go to disk. But more memory does not help as much with writes. File systems can buffer more writes to memory before flushing to disk but the flushes still need to be frequent to avoid data lost; and the writes still involve accessing random parts of the disk.
To see this clearly, below is a diagram of a traditional Unix File System involving writing two single-block files in two directories.
Unix FS involves 8 random, non-sequential writes (numbered, but not in that order). 4 to the inodes and 4 to the data blocks (2 directories, 2 files). Half of these are synchronous writes to avoid leaving the file system in an inconsistent state. The other half can be done with an asynchronous delayed write-back. Newer file systems have many optimization to help with performance, like keeping inodes and data blocks closer together, but the point remains that these types of file systems suffer from the limitation of disk access time.
Ousterhout and Rosenblum's log-structured file system gets around this by avoiding random, non-sequential writes altogether. Writes are done asynchronously in a large sequential transfer. This minimizes the access time latency and allows the file system to operate closer to the disk's maximum throughput rate. As the diagram shows, the same information is written to disk: 4 inodes and 4 data blocks (2 directories, 2 files). But it's written sequentially by appending to the log. Data (both metadata like inode and the actual file data) is never overwritten in-place, just appended to the log.
This is clever and all but how do we get the data back?!? In the traditional Unix FS the inodes are at fixed location(s). Given inode number 123 it's easy to calculate its disk location with a little math, and once we have the inode location we can get the data blocks. This doesn't work with LSF since inodes are not fixed--they're appended to the log just like the data blocks. Easy enough, create an inode map that maps inodes to their locations. Wait a second, how can we then find the location of the inode maps? Finally, it's time to write to a fixed location, the checkpoint region.
The checkpoint region knows the location of the active inode maps. At startup we read in the checkpoint region, load the locations of the inode maps into memory, then load the inode maps into memory. From then on, it's all in-memory. The checkpoint region is periodically written to disk (checked point). Once we have the inode maps read requests behave much like the traditional Unix FS: lookup the inode, perform access control, get the data blocks.
In summary, read requests don't change much and we can leverage file cache to improve performance. Write requests, however, show dramatic improvements, especially for frequent, small write requests, since we always write sequentially in large chunks.
But the story doesn't end quite yet. If we always append, never overwrite in-place, we will eventually run out of space unless we can reclaim free space. Reclaiming free space, that sounds like memory garbage collection in programming languages; and that's exactly what the LSF does, garbage collect.
Imagine that segments 5 and 6 have both live and dead blocks (files that have been deleted). The segment cleaner (garbage collector) can compact segments 5 and 6 then copy only the live blocks into an available free segment. Each segment has a segment summary block (not shown) with information about itself to help in this process (which blocks are dead, etc.). Then it's just a matter of moving the links in the segment linked list to restore the order. I'm of course hand waving here as things are more involved. Like memory garbage collection it's in the details and optimizations that will determine if the system is performant. Issues like garbage collecting long-live objects (data), when to run the collector, etc. emerge.
There you have it, the Log-structured File system. Next time, the Log Structure Merge Tree.
Comments