Timilearning

Chapter 3 - Storage and Retrieval

These are my notes from the third chapter of Martin Kleppmann's: Designing Data Intensive Applications.

Table of Contents


This chapter helps to arm with the knowledge of how databases work under the hood.

There's a difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics.

Storage Engines

There are two families of storage engines: log-structured storage engines (log structured merge trees), and page-oriented storage engines (b trees). A storage engine’s job is to write things to disk on a single node.

Log-Structured Storage Engines

Many databases internally use a log, an append-only data file for adding something to it. Each line in the log contains a key-value pair, separated by a comma (similar to a CSV file, ignoring escaping issues). The log does not have to be internally-readable, it might be binary and intended only for other programs to read.

Indexing

An index is an additional structure derived from the primary data. Any kind of index usually slows down writes, since the index has to be updated every time data is written.

Well-chosen indexes speed up read queries, but every index slows down writes.

Hash Index

These are indexes for key-value data. For a data storage that consists of only appending to a file, a simple indexing strategy is to keep an in-memory hash map where the value for every key is a byte offset which is where the key is located. The value for the key is then read.

Bitcask (the default storage engine in Riak - Riak is a distributed datastore similar to Cassandra) uses the approach above. The only requirement is that all the keys fit in the available RAM as the hash map is kept completely in memory. The values don't have to fit in memory since they can be loaded from disk with a simple disk seek. Something like Bitcask is suitable for situations where the value for a key is updated frequently.

The obvious challenge in appending to a file is that the file can grow too large and then we run out of disk space. A solution to this is to break the log into segments of a certain size. A segment file is closed when it reaches that size, and subsequent writes are made to a new segment.

We can then perform compaction on these segments. Compaction means keeping the most recent update for each key and throwing away duplicate keys. Compaction often makes segments smaller (relies on the assumption that a key is overwritten several times on average within one segment), and so we can merge several segments together at the same time as performing the compaction.

Basically, we compact and merge segment files together. The merged segment is written to a new file. This can happen as a background process, so the old segment files can still serve read and write requests, until the merging process is complete.

Each segment will have its own in-memory hash table. To find a value for a key, we'll check the most recent segment. If it's not there, we'll check the second-most-recent segment, and so on.

There are certain practical issues that must be considered in a real life implementation of this hash index in a log structure. Some of them are:

There are good reasons why an append-only log are a good choice as opposed to a storage where files are updated in place, overwriting the old value with the new value. Some of those reasons are:

Basically, when data files are far from each other, it's a form of fragmentation.

There are limitations to the hash table index though, some of them are:

So, in this approach, writes are made to the segments on a disk while the hash table index being stored is in - memory.

SSTables and LSM-Trees

In log segments with hash indexes, each key-value pair appears in the order that it was written, and values later in the log take precedence over values for the same key earlier in the log. Apart from that, the order of key-value pairs in the file is irrelevant.

There is a change to this approach in a Sorted String Table format, or SSTable for short. Here, it is required that the sequence of key-value pairs is sorted by key. Hence, new key-value pairs cannot be appended to the segment immediately. There are several advantages to this over log segments with hash indexes:

You still need an in-memory index to tell you the offsets of some keys, but it can be sparse.

SSTables store their keys in blocks, and have an internal index, so even though a single SSTable may be very large (gigabytes in size), only the index and the relevant block needs to be loaded into memory.

Constructing and maintaining SSTables

It's possible to maintain a sorted structure on disk( see B-Trees) but maintaining it in memory is easier and is the approach described here. The approach is to use well-known tree data structures such as red-black trees or AVL trees into which keys can be inserted in any order and read back in sorted order.

So the storage engine works as follows:

An obvious problem with this approach is that if the database crashes, the most recent writes (which are in the memtable but not yet written to disk) will disappear. To avoid that problem, one approach is to keep a separate log on disk to which every write is immediately appended. This separate log is not in sorted order, but that's irrelevant because the content can easily be sorted in a memtable. The corresponding log can be discarded every time the memtable is written out to an SSTable.

Making an LSM-tree out of SSTables

The algorithm described above is used in LevelDB and RocksDB. Key-value storage engine libraries are designed to be embedded into other applications. Among other things, LevelDB can be used in Riak as an alternative to Bitcask as its storage engine.

This indexing structure was originally described under the name Log-Structured Merge-Tree.

Lucene which is an indexing engine for full-text search uses a similar method for storing it's term dictionary. A full-text index is more complex than a key-value index but is based on a similar idea: given a word in a search query, find all the documents that mention the word. It's usually implemented with a key-value structure where the key is a word ( a term) and the value is a list of IDs of all the documents that contain the word (the postings list). In Lucene, the mapping from term to postings list is kept in SSTable-like sorted files that are merged in the background as needed.

Right so, LSM trees are the indices, but the actual storage on disk is as an SSTable? Or do we still need to keep an in-memory index of the memtable?

Performance Optimizations

The LSM-tree algorithm can be slow when looking up keys that do not exist in the database: you first have to check the memtable, then all the segments all the way up to the oldest (possibly having to read from disk for each one) to be certain that the key does not exist. In order to optimize this access, storage engines often make use of Bloom filters.

A Bloom filter is a memory- efficient data structure for approximating the contents of a set. It can tell you if a key does not appear in a database, thus saving you from unnecessary disk reads for nonexistent keys.

There are also strategies to determine the order and timing of how SSTables are compacted and merged. Two most common options are size-tiered and leveled compaction. LevelDB and RocksDB use leveled compaction, Hbase uses size-tiered and Cassandra supports both.

Size-Tiered Compaction: Here, newer and smaller SSTables are successively merged into older and larger SSTables.

Leveled Compaction: The key range is split into smaller SSTables and older data is moved into separate "levels". This allows compaction to proceed more incrementally and use less disk space. The levels are structured roughly so that each level is in total 10x as large as the level above it. New keys arrive at the highest layer, and as that level gets larger and larger and hits a threshold, some SSTables at that level get compacted into fewer (but larger) SSTables one level lower.

Within a single level, SSTables are non-overlapping: one SSTable might contain keys covering the range [a,b), the next [b,d), and so on. The key-space does overlap between levels: if you have two levels, the first might have two SSTables (covering the ranges above), but the second level might have a single SSTable over the key space [a,e). Looking for the key aardvark requires looking in two SSTables: the [a,b) SSTable in Level 1, and the [a,e) SSTable in Level 2.

Basically, a level has many SSTables.

B-Trees

B-trees are a popular indexing structure. Like SSTables, they keep key-value pairs sorted by key, but the similarity ends there.

Log-structured indexes break the database down into segments, however B-trees break the database down into fixed size blocks or pages. Each page can be identified with its address or location on disk, which allows one page to refer to another. Pages are usually small in size, typically 4kb compared to segments which can be several megabytes. Pages are stored on disk.

Right - so are B-trees simply an indexing structure? Like they don’t refer to how the underlying data is stored?

One page is designated as the root of the B-tree; whenever you want to look up a key in the index, you start here. The page contains several keys and references to child pages. Each child is responsible for a continuous range of keys, and the keys between the references indicate where the boundaries between those ranges lie.

Branching factor: The number of references to child pages in one page the B-tree.

To update the value of an existing key in a B-tree, you search for the leaf page containing that key, change the value in that page, and write the page back to disk (any references to that page remain valid). To add a new key, find the page whose range encompasses the new key and add it to that page. If there's no free space on that page, split the page into two half-full pages, and update the parent page to account for the new subdivision of key ranges.

Making B- Trees reliable

The main write operation of a B-tree is to overwrite a page on disk with new data. The assumption is that an overwrite does not change where a page is located i.e. all the references to a page typically remain intact when the page is overwritten. This differs from LSM trees where things are never updated in place, and are append only.

Some operations require different pages to be overwritten e.g. when a page is split because an insertion caused it to be overfull. We'll need to write the two pages that were split and update the parent page with references to the two child pages. This operation is dangerous especially if the database crashes after only some pages have been written, this can lead to a corrupted index.

A solution used to make databases resilient to crashes is to keep a write-ahead log on disk. It is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. It's used to restore the DB when it comes back from a crash.

There are also concurrency issues associated with updating pages in place. If multiple threads access a B-tree at the same time, a thread may see the tree in an inconsistent state. The solution is usually implemented by protecting the tree's data structures with latches (lightweight locks). This is not an issue with log structured approaches since all the merging happens in the background without interfering with incoming queries.

B-tree optimizations

Different optimizations have been made with B-trees:

So, page structured storage engines are organized into fixed-size pages. These pages are all part of a tree called b-tree.

SQLite, for example, has a btree for every table in the database, as well as a btree for every index in the database. For the indexes , the key stored on a page is the column value of the index, while the value is the row id where it can be found. For the table btree, the key is the rowid while I believe the value is all the data in that row: https://jvns.ca/blog/2014/10/02/how-does-sqlite-work-part-2-btrees/

https://hackernoon.com/fundamentals-of-system-design-part-3-8da61773a631

Comparing B-Trees and LSM-Trees

As a rule of thumb, LSM trees are typically faster for writes, while B-trees are thought to be faster for reads. Reads are slower on LSM-trees because they have to check different data structures and SSTables at different stages of compaction.

*Don't forget that in the end, these databases are just abstractions over writing to disk.

Advantages of LSM Trees

On many SSDs, the firmware internally uses a log-structured algorithm to turn random writes into sequential writes on the underlying storage chips, so the impact of the storage engine's write pattern is less pronounced (point 2). Note that lower write amplification and reduced fragmentation is still advantageous on SSDs: representing data more compactly allows more read and write requests within the available I/O bandwidth.

Downsides of LSM Trees

Other Indexing Structures

We've mainly covered key-value indexes which are like a primary key index, but we can also have secondary indexes. You can typically create several secondary indexes on the same table in relational databases. A secondary index can be constructed from a key-value index.

With secondary indexes, note that the indexed values are not necessarily unique. Several rows (documents, vertices) may exist under the same index entry. This can be expressed in two ways:

Both B-trees and log-structured indexes can be used as secondary indexes.

Storing values within the index

The key in an index is the thing that queries search for, but the value can be either:

  1. The actual row (document, vertex) in question
  2. A reference to the row stored elsewhere. The rows in this case are stored somewhere known as a heap file which stores data in no particular order (could be append-only, or may keep track of deleted rows in order to overwrite them with new data later). This approach is common because it avoids duplicating data in the presence of several secondary indexes. Each index just references a location in the heap file.
Approach 2 - Heap file

The heap file approach can be efficient when updating a value without changing the key, provided the new value is not larger than the old value. If it is larger, the record might need to be moved to a new location in the heap where there is enough space. When this happens, either all indexes need to be updated to point to the new heap location of the record, or a forwarding pointer is left behind in the old heap location.

Approach 1 - Actual row

In some cases, the hop from the index to the heap file is too much of a performance penalty for reads, so the indexed row is stored directly within the index. This is known as a clustered index. In MySQL's InnoDB storage engine, the primary key of a table is always a clustered index, and secondary indexes refer to the primary key (rather than a heap file).

Covering Index

There's a compromise between a clustered index (storing all row data within the index) and a nonclustered index(storing only references to the data within the index) which is known as a covering index or index with included columns, which stores some of a table's columns within the index. With this approach, some queries can be answered using the index alone).

Multi-column indexes

We've only dealt with indexes which map a single key to a value so far. We need more than that to query multiple columns of a table, or multiple documents simultaneously.

The most common type of multi-column index is a concatenated index. This type of index combines several fields into one key by appending the columns. This kind is useless if you only want to search for the values for one of the columns. The columns should be in the order of the common search queries/pattern, because the index will sort by the first column.

Another approach is a multi-dimensional index. These kind are a more general way of querying several columns at once, which is useful for geospatial data, for example. Say you want to perform a search for records between both a longitude range and a latitude range, an LSM tree or B-tree cannot answer that efficiently. You can get all the records within a range of latitudes (but at any longitude), and within a range of longitudes, but not both simultaneously.

An option is to translate a two-dimensional location into a single number using a space-filling curve(?) and then use a regular B-tree index.

Full-text search and fuzzy indexes

The indexes we've discussed so far assume that we have exact data, and we know the exact values of a key, or a range of values of a key with a sort order. For dealing with things like searching similar keys, such as misspelled words, we look at fuzzy querying techniques.

Levenshtein automaton: Supports efficient search for words within a given edit distance.

Keep everything in memory

The data structures discussed so far provide answers to the limitations of disks. Disks are awkward to deal with compared to main memory. With both magnetic disks and SSDs, data on disk must be laid out carefully to get good read and write performance. Disks have 2 significant advantages over main memory though:

There have been developments of in-memory databases lately, especially since RAM has become cheaper and many datasets are not that big so keeping them in memory is feasible.

In-memory databases aim for durability by:

VoltDB, MemSQL and Oracle TimesTen are in-memory databases with a relational model.

Interestingly, the performance advantage of in-memory databases is not due to the fact that they don't need to read from disk. A disk-based storage engine may never need to read from disk if there's enough memory, because the OS caches recently used data blocks in memory anyway. Rather, they can be faster because there's no overhead of encoding in-memory data structures in a form that can be written to disk.

Besides performance, an interesting area for in-memory databases is that they allow for the use of data models that are difficult to implement with disk-based indexes. E.g. Redis offers a db-like interface to data structures such as priority queues and sets.

Recent research indicates that in-memory database architecture can be extended to support datasets larger than the available memory, without bringing back the overheads of a disk-centric architecture. This approach works by evicting the least recently used data from memory to disk when there's not enough memory, and loading it back into memory when it's accessed again in the future.

Transaction Processing or Analytics?

Transaction: A group of reads and writes that form a logical unit.

OLAP: Online Analytics Processing. Refers to queries generally performed by business analytics that scan over a huge number of record and calculating aggregate statistics.

OLTP: Online Transaction Processing. Interactive queries which typically return a small number of records.

In the past, OLTP-type queries and OLAP-type queries were performed on the same databases. However, there's been a push for OLAP-type queries to be run on data warehouses.

Data Warehousing

A data warehouse is a separate DB that analysts can query without affecting OLTP operations. The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company. Data is extracted from OLTP databases and loaded into the warehouse using an ETL (Extract-Transform-Load) process.

It turns out that the indexing algorithms discussed so far work well for OLTP, but not so much for answering analytics queries.

Transaction processing and data warehousing databases look similar, but the latter is optimized for analytics queries. They are both often accessible through a common SQL interface though.

A number of SQL-on-Hadoop data warehouses have emerged such as Apache Hive, Spark SQL, Cloudera Impala etc.

Stars and Snowflakes: Schemas for Analytics

Many data warehouses are used in a fairly formulaic style - star schema (or dimensional modelling). The name "star schema" comes from the fact that when the table relationships are visualized, the fact tables is in the middle, surrounded by its dimension tables (these represent the who, what, where, when, how, and why); The connections to the these tables are like the rays of a star.

We also have the snowflake schema, where dimensions are further broken down into subdimensions.

Column Oriented Storage

In most OLTP databases, storage is laid out in a row-oriented fashion: all the values from one row of a table are stored next to each other. Document databases are similar: an entire document is typically stored as one contiguous sequence of bytes.

Analytics queries often access millions of rows, but few columns.

The idea behind column-oriented storage is straight forward: don't store all the values from one row together, but store all the values from each column together instead. If each column is stored in a separate file, a query only needs to read and parse the columns that it is interested in, which can save work.

The column-oriented storage layout relies on each column file containing the rows in the same order.

Column Compression

In addition to loading only the columns from disk that are required for a query, we can reduce the demands on disk throughput by compressing data. Column-oriented storage lends itself well to compression. Different compression techniques can be used such as Bitmap encoding. The number of distinct values in a column is often small compared to the number of rows. Therefore, we can take a column with n distinct values and turn it into n separate bitmaps: one bitmap for each distinct value, with one bit for each row. The bit is 1 if the row has tat value, and 0 if not.

So I believe that in column oriented storage, each 'file' for a column has one row, where the number of columns for that row will be the number of rows in a standard row-wise table, and columns contain the rows in the same order.

Column-oriented storage and column families

Cassandra and Hbase have a concept of column families, which differs from being column oriented. Within each column family, they store all the columns from a row together, along with a row key, and do not use column compression.

Sort Order in Column Storage

It doesn't really matter in which order the rows are stored in a column store. It's easiest to store them in the order of insertion, but we can choose to impose an order.

It won't make sense to sort each column individually though because we'll lose track of which columns belong to the same row. Rather, we sort the entire row at a time, even though it is stored by column. We can choose the columns by which the table should be sorted.

Several different sort orders

C-stores provide an extension to sorting on column stores. Different queries benefit from different sort orders, so why not store the same data sorted in several different ways?

Writing to Column-Oriented Storage

Writes are more difficult with column-oriented storage. An update-in-place approach, like B-trees use, is not possible with compressed columns. To insert a row in the middle of a sorted table, you would likely have to rewrite all the column files.

Fortunately, a good approach for writing has been discussed earlier: LSM-trees. All writes go to an in-memory store first, where they are added to a sorted structure and prepared for writing to disk. It doesn't matter whether the in-memory store is row-oriented or column-oriented.

Further reading:

Oracle internals: https://stackoverflow.com/a/40740893

How indexes work: https://stackoverflow.com/questions/1108/how-does-database-indexing-work

SQL Server indexes: https://sqlity.net/en/2445/b-plus-tree/

How data is stored on disk: https://www.red-gate.com/simple-talk/sql/database-administration/sql-server-storage-internals-101/

Last updated on 10-06-2020

learning-diary ddia distributed-systems

To get notified when I write something new, you can subscribe to the RSS feed.

A small favour

Did you find anything I wrote confusing, outdated, or incorrect? Or do you have an answer to a question I posed here? Please let me know! Just write a few words below and I'll be sure to amend this post with your suggestions.

← Home