MIT 6.824: Lecture 2 - RPC and Threads
· 6 min readThis course is based on the Go programming language, and this post will introduce some features in Go that make it well suited for building concurrent and distributed applications.
Table of Contents
Threads #
Threads are the unit of execution on a processor. When a program is run on your computer, that starts up a process. That process can then be made up of one or more threads which execute different tasks. Some key things to note about threads are:
- All the threads in a process share memory. They all have access to global variables.
- Each thread has keeps its own stack, program counter and registers.
Why use threads? #
- Threads enable concurrency, which is important in distributed systems. Concurrency allows us to schedule multiple tasks on a single processor. These tasks are run in an interleaved manner and essentially share CPU time between themselves. For example: with I/O concurrency, instead of waiting for an I/O operation to complete before continuing execution (thereby rendering the CPU idle), threads allow us to perform other tasks while we wait.
- Parallelism: We can perform multiple tasks in parallel on several cores. Unlike with just concurrency where only one task is making progress at a time (depending on which has its share of CPU time at that instant), parallelism allows multiple tasks to make process at the same time since they are executing on different CPU cores.
- Convenience: Threads provide a convenient way to execute short-lived tasks in the background e.g. a master node continuously polling a worker to check if it's alive.
Go has Goroutines, which are lightweight threads for managing concurrency.
What if we can't have multiple threads? #
There's a concept of Event Driven Programming where a process only has a single thread which listens for events and executes user specified functions when the event occurs. This concept is used by Node.js and the thread is known as the event loop.
The key thing here is that although the application appears to run on a single thread from the programmer's perspective, the runtime internally uses multiple threads to handle tasks. The main difference is that the programmer does not have to deal with these internal threads and the challenges of coordination between them. All the programmer has to do is specify callback functions to be executed on the main thread when those background tasks have completed.
When the single thread receives an event (like a button click or a task completion), it pauses its current task, executes the callback function for the event, and then returns to the paused job.
Downsides of Event-Driven Programming #
- You will need additional coordination between processes to gain the benefits of parallelism on a multi-core system. With Node.js, you can fire up child processes to be run on each CPU, but you need to handle coordination between those processes.
- It's harder to implement this pattern (Though this is subjective, of course).
Threading Challenges #
-
Deadlocks: These happen when two or more threads are waiting on each other in such a way that neither can progress.
-
Accessing shared data: What happens if two threads do n = n + 1 at the same time? Or one thread reads a value while another one increments it? This is known as a race condition. Using Go's sync.Mutex to add locks around the shared data is one way to solve this problem. An alternative to that is to avoid sharing mutable data. Go has a built-in data race detector
-
Coordination between threads: If one thread is producing data while another is consuming that data, it raises questions like "How can the consumer wait for data to be produced, and release the CPU while waiting?" and "How can the producer then wake up the consumer?"
Go has channels and WaitGroups for coordinating communication between threads.
Remote Procedure Call (RPC) #
RPC is a means of client/server communication between processes on the same machine or different machines. Here, the client executes a procedure (function/method) on a remote service as if it were a local procedure call.
The steps that take place during an RPC are as follows [1]:
- A client invokes a client stub procedure, passing parameters in the usual way. The client stub resides within the client's own address space.
- The client stub marshalls the parameters into a message. Marshalling includes converting the representation of the parameters into a standard format, and copying each parameter into the message.
- The client stub passes the message to the transport layer, which sends it to the remote server machine.
- On the server, the transport layer passes the message to a server stub, which demarshalls the parameters and calls the desired server routine using the regular procedure call mechanism.
- When the server procedure completes, it returns to the server stub (e.g., via a normal procedure call return), which marshalls the return values into a message. The server stub then hands the message to the transport layer.
- The transport layer sends the result message back to the client transport layer, which hands the message back to the client stub.
- The client stub demarshalls the return parameters and execution returns to the caller.
The main benefit of this is that it simplifies the process of writing distributed applications since RPC hides all the network code into stub functions. Programmers don't have to worry about details like data conversion and parsing, and opening and closing a connection.
Note: The client knows what server to talk to through binding. Go has an RPC library to ease this communication between processes. In Go's RPC library, the server name and port are passed as arguments to a method when setting up the connection.
Dealing with failures #
From the perspective of a client, failure means sending a request to the server and not getting a response back within a particular time out. This can be caused by a number of things including lost packets, slow server, crashed server, and a broken network.
Dealing with this is tricky because the client would not know the actual status of its request. Possible scenarios are:
- The server never saw the request.
- The server executed the request but crashed just before sending a reply.
- The server executed a request and sent the reply, but the network died before delivering the reply.
The simplest way to deal with a failure would be to just retransmit the request; however, if the server had already executed the request, resending it could mean the server executes the same request twice, which could lead to unwanted side effects. This failure handling method works well for idempotent requests i.e. operations that have the same effect when executed multiple times as if they were executed once. Many operations are not idempotent, and so we need a more general approach to handle failures.
RPC Semantics #
An RPC implementation can use any of the following semantics for making requests :
- At-Most-Once: At-most-once semantic ensures that a request will not be retried automatically by the client. In this case, resending a request is opt-in for the client. Therefore, without an explicit retry mechanism for a failed request, a request may be lost and never executed. If the request is retried, the server is responsible for detecting duplicate requests and ensuring that only one succeeds.
- At-Least-Once: Here, a request may be executed one or more times and may not be lost. The client will keep retrying the request until it receives a positive acknowledgement that the request has been executed. This is appropriate for requests with no side effects (like read-only requests) and idempotent operations.
- Exactly-Once: In this mode, requests can neither be duplicated nor lost. This is harder to achieve and the least fault tolerant because it requires that a response must be received from the server, and there can be no duplicates. If we have multiple servers and the one handling the initial request crashes, the other servers may not be able to tell whether the request was executed or not by the initial server, and it becomes a challenge agreeing on a decision for that.
Go RPC #
Go RPC guarantees at-most-once semantic. If it doesn't get a reply, it will just return an error. The client can opt to retry a failed request, but it is up to the server to handle duplicate requests to maintain the at-most-once guarantee; what if the request actually executed but the reply got lost?
Some complexities related to at-most-once communication between processes are:
- How do we guarantee that the ID of a request is unique between multiple clients? One way to do this is by generating a request ID which combines the unique client ID with a sequence number.
- For detecting duplicates, how long should each request ID be kept for? We cannot keep all the request IDs indefinitely, so they have to get discarded at a point. A method for handling this could be for each client to include an extra identifier with each request. Let's call it X. The extra identifier will tell the server that it is safe to delete all request IDs that came before X.
- How do we handle duplicate requests while the original request is still executing? We could have a "pending" flag next to each executing RPC and wait for it to complete, or simply ignore the new request.
- What if the server crashes and restarts with the duplicate info being kept in memory? The server could write duplicate info to disk. The server could also replicate information about duplicates across multiple machines.
[1] Remote Procedure Call (RPC) - Lecture notes from Worchester Polytechnic Institute
Further Reading #
- Lecture 2: Infrastructure - RPC and Thread - MIT 6.824 Lecture Notes.
- Remote Procedure Calls by Paul Krzyzanowski