Timilearning

Chapter 4 - Encoding and Evolution

These are my notes from the fourth chapter of Martin Kleppmann's: Designing Data Intensive Applications.

Table of Contents


We should aim to build systems that make it easy to adapt to change: Evolvability.

Rolling upgrade: Deploying a new version to a few nodes at a time, checking whether the new version is running smoothly, and gradually working your way through all the nodes.

With rolling upgrades, new and old versions of the code, and old and new data formats may potentially all coexist in the system at the same time. For a system to run smoothly, compatibility needs to be in both directions.

Backward compatibility: Newer code can read data that was written by older code. Simpler to achieve.

Forward compatibility: Older code can read data written by newer code. This is trickier because it requires older code to ignore additions made by a newer version of the code.

Formats for Encoding Data

Programs work with data that have at least 2 different representations:

We need some kind of translation between the two representations:

Encoding/Serialization/Marshalling - Translation from in-memory representation to a byte sequence.

Decoding/ Deserialization - Byte sequence to in-memory representation.

There are a number of different libraries and encoding formats to choose from which we'll discuss next.

Language-Specific Formats

Different programming languages have their built-in support for encoding in-memory objects into byte sequences. Java has Serializable, Ruby has Marshal and so on. However, these language-specific encodings have their own problems:

JSON, XML, and Binary Variants

JSON and XML are the obvious contenders for standard encodings. CSV is another option. These formats are widely known, widely supported, and almost as widely disliked.

XML is often criticized for its verbose syntax. Apart from superficial syntactic issues, they also have subtle problems:

Binary Encoding

The choice of data format can have a big impact especially when the dataset is in the order of terabytes.

JSON is less verbose than XML, but both still use a lot of space compared to binary formats. This has led to a number of binary encodings for JSON (BSON, BJSON etc.) and XML (WBXML etc.). BSON is used as the primary data representation in MongoDB for example.

Thrift and Protocol Buffers: These are binary encoding libraries. Protocol Buffers was developed at Google, while Thrift was developed at Facebook.

Avro: Another binary encoding format different from the two above. This started out as a sub project of Hadoop.

These encoding libraries have some interesting encoding rules which I skipped: http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Modes of Dataflow

Recall that it was stated earlier that to send data from one process to another with which you don't share memory, it needs to be encoded as a sequence of bytes.

We also said that forward and backward compatibility are important.

Here, we'll explore how data flows between processes:

Dataflow Through Databases

The process that writes to a database encodes the data, while the process the reads from it decodes it. It could be the same process doing both, or different processes.

Forward compatibility is required in databases: If different processes are accessing the database, and one of the processes is from a newer version of the application ( say during a rolling upgrade), the newer code might write a value to the database. Forward compatibility is the ability of the process running the old code to be able to read the data written by the new code.

We also need backward compatibility so that code from a newer version of the app can read data written by an older version.

Data outlives code and often times, there's a new to migrate data to a new schema. Avro has sophisticated schema evolution rules that can allow a database to appear as if was encoded with a single schema, even though the underlying storage may contain records encoded with previous schema versions.

Dataflow Through Services: REST and RPC

When there's communication between processes over a network, a common arrangement is to have two roles: clients (e.g. web browser) and servers.

The server typically exposes an API over the network for the client to make requests. This API is known as a service.

A server can also be a client to another service. E.g. a web app server is usually a client to a database.

A difference between a web app service and a database service is that there's usually tighter restriction on the former.

Service-oriented architecture (SOA): Decomposing a large application into smaller components by area of functionality.

Web Services: If a service is communicated with using HTTP as the underlying protocol, it is called a web service. Two approaches to web services are REST and SOAP (Simple Object Access Protocol).

RPC: The RPC model tries to make a request to a remote network look the same as calling a function or method, within the same process ( location transparency - In computer networks, location transparency is the use of names to identify network resources, rather than their actual location).

There are certain problems with this approach though, which can be summarized under the fundamental fact that network calls are different from function calls. E.g.:

Despite these problems, RPC isn't going away. The new generation of RPC frameworks are explicit about the difference between a remote request and a local function call such as Finagle, Rest.li and GRPC.

The main focus of RPC frameworks is on requests between services owned by the same organization, typically within the same datacenter.

Q - How exactly do RPCs differ from REST? Is it just the way the endpoints look?

Message-Passing Dataflow

Asynchronous message-passing systems are somewhere between RPC and databases.

Advantages of a message broker

The communication pattern here is usually asynchronous - the sender does not wait for the message to be delivered, but simply sends it and forgets about it.

Message brokers

The configuration settings of message brokers typically vary, but in general they're used like:

A topic can have many producers and many consumers.

Distributed actor frameworks

The actor model is a programming model for concurrency in a single process. Each part of the system is represented as an actor. An actor is usually a client or an entity which communicates with other actors by sending and receiving asynchronous messages.

In distributed actor frameworks, this model is especially useful for scaling an application across multiple nodes as the same message-passing mechanism is used, regardless of whether the sender and recipient are on the same or different nodes.

This framework integrates the actor programming model and the message broker into a single framework. 3 popular distributed actor frameworks are:

learning-diary ddia distributed-systems

To get notified when I write something new, you can subscribe to the RSS feed.

A small favour

Did you find anything I wrote confusing, outdated, or incorrect? Or do you have an answer to a question I posed here? Please let me know! Just write a few words below and I'll be sure to amend this post with your suggestions.

← Home