Several complications intensify when you utilize microservices architecture, and one of the complications that can wreak havoc on your application is a race condition. Find out what race conditions are and how to avoid them.

Abstract

Race conditions impact all applications that work with some level of parallelism. However, in a distributed system, the complexity grows because we have more concerns than in centralized applications. For use cases that rely on consistency, a distributed system must synchronize its state within itself and its dependencies. In some applications, implementing a distributed critical section might be necessary to have transactional operations (for example, when using NoSQL databases with no native support to transactions, or when keeping data replicated across different microservices updated). In this context, this paper will present a design pattern that shows how to implement a distributed critical section using external data storage to acquire and release locks, making it possible to achieve mutual exclusion in a distributed context.

Keywords: Race Condition. Microservices. Distributed Systems. ACID. Consistency.

Introduction

"Microservices architecture" describes a way to design software applications as independently deployable services, each one encapsulating specific requirements of a whole. However, microservices may still communicate with each other, which leads to all kinds of problems that distributed systems are subjected to [3].

One of these problems is the innate tradeoff between availability and consistency, as dictated by the CAP theorem [4][5]. The scenario becomes even more complicated when you have to keep data synchronized across different services, which may lead to race conditions and other classic computing problems [2].

This article was created to help the reader understand how classic computing problems, such as race conditions, may impact the design and development of a microservice application [1][2]. Also, we'll introduce the concept of a distributed critical section, which solves the problem of race conditions with distributed systems. In addition to this, this paper will present:

  1. Class diagrams with a proposed design pattern to implement a distributed critical section;
  2. Code snippets with basic implementations of the proposed solution.

What is not in the scope of this article:

  1. Distributed algorithms for synchronizations, like Lamport's timestamps or consensus algorithms for block chain;
  2. In-depth analysis of how some particular system works, like MongoDB, Cassandra, etc;
  3. To keep things simple (if possible), we won't discuss deadlocks. However, we'll discuss starvation, which is a related topic.

In the introduction below, we'll explain how race conditions may impact the development of distributed systems, and we'll also introduce one specific scenario to work on. In the sections titled "A Design Pattern" and "Code Speaks Louder Than Words," the reader will find abstract solutions to these problems, also featuring a class diagram and some code implementations. Finally, at the end of the text, you will find a critical note on this work.

What are Race Conditions?

Even though the section title refers to race conditions only, be aware that we're going to review several concepts that will aid you in understanding the problem and solution proposed by this article.

In 1965, Edsger Dijkstra published a single-page paper in which he described and solved a problem that would impact computer science for the foreseeable future. The premise is simple:

Given N computer processes running simultaneously, if two or more of these processes access a resource at the same time, we may incur a consistency problem. The reason is that the resource is shared across the N processes, such that after multiple accesses, we can't determine the resource's state appropriately. This problem is known as a race condition [2].

The solution, as proposed by Dijkstra, was a simple algorithm known as mutual exclusion. It is described as follows:

  1. while true
  2.     acquire lock
  3.     write resource
  4.     release lock
  5. end-while

In the third line of the snippet above, we determine a code region called critical section, where all execution threads will run synchronously. Essentially, the process that acquires the lock first will execute, while the others will have to wait until it's released [2].

Although synchronization solves the race condition, if all locks aren't released after a process leaves the critical section, other waiting processes will be blocked for an infinite amount of time. If a computer process had the same needs as a human, this means it would die in starvation [2]. This sounds a little rough, but it's quite accurate if we translate this analogy to real computational resources.

Hence, we determine two problems in concurrent programming:

  1. Race condition: two processes accessing the same resource at the same time may lead to inconsistencies;
  2. Starvation: unreleased locks will make the critical section inaccessible by other processes, which will be blocked forever unless someone takes external actions.

But how do these problems impact distributed systems such as scalable microservices? It's simple: if a set of microservices share data at some level, keeping this data synchronized might lead to race conditions. That's why we introduced the concept of a transaction, which we'll explore further in the next section [1][6].

The Distributed Systems Paradigms

A transaction represents a single unit of logic work, containing one or multiple operations. It encapsulates commands that change the resource's state with the following properties [1]:

  1. Atomicity: all operations are indivisible and irreducible; either all of them occur or none occur;
  2. Consistency: the resources must be led from a valid state to another valid state;
  3. Isolation: concurrent transactions produce the same results;
  4. Durability: committed transactions remain committed even in case of failures or power outage.

If you develop an application whose operations are all ATOMIC, CONSISTENT, ISOLATED, and DURABLE, hence ACID, you will have an application whose state is synchronized with any external dependency [1][5].

Please note that this confronts availability directly: if your operations are all ACID and an external dependency is out of service, you need to revert (a.k.a. rollback) your operations. This happens because you can't keep your application and its dependencies with their states synchronized. Therefore, your system's availability will be impacted by any partitions in the network [4].

If the problem above bothers you deeply, know that you're not alone: there is an alternative to the ACID model and, ironically enough, it's called BASE. If your system is BASICALLY AVAILABLE and allows SOFT STATE, meaning that it exports operations that will EVENTUALLY lead your data to a CONSISTENT state, hence BASE, you can afford to keep your system available all the time. This is true even if one or more dependencies are out of service. This model is massively implemented on NoSQL databases, such as Cassandra [5][8].

However, these models must be implemented according to the system requirements. If your application demands consistency over availability, make all of your operations as acid as you can. On the other hand, if availability is what you need, you can model synchronization-related tasks to be asynchronous and independent on the external service's statuses [6].

Please build your systems in a way that meets the requirements. Do not consider emerging technologies as silver bullets just because they are nice, enchanting, or fast. When you do this, you risk building solutions that will eventually fail.

A Specific Scenario

Let's suppose you built a microservice application that's also horizontally scalable. Aside from the data you maintain internally, you also have to update a dependency with some specific piece of information. This means that you store a replica of this dependency's contract internally.

Note that you can't allow one or more instances of your application to update the dependency at the same time! The reason is race conditions. However, you only need to synchronize updates for one entity, meaning that two different entities being updated won't incur concurrency.

In this context, we need to implement a distributed critical section. This means that the N processes described by Dijkstra now run on different computers [2]! Also, they need to be synchronized by state, meaning that our critical section's lock increased a little in complexity.

The pattern we want to propose is similar to Saga design pattern [1]. However, we want to restrict it to the following conditions:

  1. Two transactions with the same state can't happen at the same time (avoid race condition);
  2. The process trying to access a resource will eventually acquire it (avoid starvation).

The reader can understand state as a data that will uniquely block the critical section. In our example above, it can be your entity's state. Since all updates in the external dependency need to be synchronized by entity, the entity's unique identifier is the most obvious candidate.

Please note that we provide a design pattern that avoids race conditions and helps you to avoid starvation, but it doesn't necessarily mean your operations will be ACID. Rather, it means that you will have less concerns when implementing your ACID operations yourself, assuming they are concurrent at some level.

To make your process follow the ACID model accurately, make sure your distributed critical section only runs ACID operations. We recommend that you apply the Saga design pattern in this case [1].

A Design Pattern

The key to making a distributed critical section is making your application's instances agree that a thread should be blocked for a given state. Let's suppose that instances i and j receive requests to update an entity x at the same time. Then, i and j must agree that only one of them will update x at a time. In this article, we'll use an external consistent data repository that i and j can access and agree upon x's state.

Consider the enhanced mutual exclusion algorithm below:

  1. do 
  2.     try lock x
  3. while x is not locked
  4. write resource
  5. release lock x

Please notice that the operations in lines 2 and 5 depend on the external data repository, which must be consistent for all of your application's instances. In this case, the critical section in line 4 will now be synchronized for any process of your application.

However, how do we implement this solution with a proper design pattern? We will present the components, their responsibilities, and their relationships in order to implement it. Please check the class diagram in Figure 1 for more details.

Figure 1 - A relationship diagram that illustrates a design pattern to implement critical sections in a distributed context. The purpose of this pattern is to avoid multiple executions of concurrentOperation for the same state. T is a generic class.

From the diagram in Figure 1, we can see that each interface is designed for a purpose, and by combining their operations, we can properly implement a distributed critical section. The components' roles are, respectively:

  1. TransactionService: asks for LockStateService to lock the received state. When the lock is acquired, it will execute the concurrentOperation. Afterwards, this component will ask LockStateService to release the state;
  2. LockStateService: given a received state, this component will attempt to persist the state in the StateRepository. StateRepository may say that the state is already in use by someone else, so it's under LockStateService's assignments to retry until the state is released. When it is released, it returns the locked state to the caller. Another exported operation then receives the locked state and releases it by calling StateRepository;
  3. StateRepository: atomically inserts a state to a shared storage system (database, Web server etc.). It accepts only the first and a unique insertion, meaning that the following ones will fail with a corresponding error. It also exports an operation to remove an existing state from the system;
  4. Lock: a logic representation of a locked state.

For a basic implementation, as well as details on which behaviours are expected to change when extending these base components, the reader can refer to the next section.

Code Speaks Louder Than Words 

We developed a Java code example for the proposed solution. You can view it directly from our source code repository or check out the listings below [7]. For each listing, the reader will see a small description of the class and which behavior it's expected to change when it is extended.

class Lock<T> {

private final T state;
// Implement any other lock metadata you wish

public Lock(T state) {
this.state = state;
}

public T state() {
return state;
}

}

(See GitLab code here.)

Listing 1 - The Lock class. Logic representation of a locked state. You can add some metadata to the lock if needed.

import java.util.function.Consumer;

/**
* A class that implements a transaction such that no race condition is
* expected for concurrent operations. Two operations are said to be
* concurrent whenever their states are equal.
*
* @param T The state which will synchronize your transactions.
*/
class TransactionService<T> {

private final LockStateService<T> lockStateService;

public TransactionService(LockStateService<T> lockStateService) {
this.lockStateService = lockStateService;
}

public void doInTransaction(T state, Consumer<T> concurrentOperation) {
try {
Lock<T> acquiredLock = lockStateService.acquire(state);

doInTransactionWhenLockIsAcquired(acquiredLock, concurrentOperation);
} catch (Exception exception) {
/**
* Any exception during commit or rollback flows? Maybe you'd like
* to catch them here.
*/
}
}

/**
* If the lock is acquired, the following code will run only in this process while
* other processes, whose `acquiredLock.state()` is equivalent to this one, will
* have to wait.
*
* Hence, you can consider this method will follow ACID guidelines as long as
* `concurrentOperations` also follows them. However, we do guarantee that no
* race conditions will occur, as well as that every resource acquired will
* be released.
*/
protected void doInTransactionWhenLockIsAcquired(Lock<T> acquiredLock, Consumer<T> concurrentOperation) {
try {
concurrentOperation.accept(acquiredLock.state());
commit(acquiredLock);
} catch (Exception exception) {
rollback(acquiredLock, exception);
} finally {
lockStateService.release(acquiredLock);
}
}

/**
* Implement your commit operations.
*/
protected void commit(Lock<T> acquiredLock) {

}

/**
* Implement your rollback operations.
*/
protected void rollback(Lock<T> acquiredLock, Exception exception) {

}
}

(See GitLab code here.)

Listing 2 - The TransactionService class. This is the implementation of the mutual exclusion algorithm, extended to aggregate components that also support a distributed context. To change how the mutual exclusion happens, this class is the one you should extend.

import java.util.Optional;

/**
* A class responsible for locking states. This is where you can implement
* a proxy between your transaction and its resource acquisition!
*
* @param T The state's class.
*/
class LockStateService<T> {

private final StateRepository<T> stateRepository;

public LockStateService(StateRepository<T> stateRepository) {
this.stateRepository = stateRepository;
}

public Lock<T> acquire(T state) {
/**
* Create a shared state. This state is said to be shared because its value
* must be the same accross any instances connecting to the same data repository.
*/
Optional<T> sharedState = Optional.empty();

/**
* This section is responsible to retry acquiring the lock. Here,
* the programmer can implement whichever policy that better fits his
* requirements. For instance, this section will retry infinitely and
* as fast as possible to acquire the state.
*
* You can do differently: maybe implement a limited amount of retries,
* a timed executor etc.
*/
while (!sharedState.isPresent()) {
sharedState = Optional.ofNullable(attemptLockAcquisition(state));
}

/**
* If the process could lock the state, go ahead and return it to the
* transaction scope. Otherwise, throw an exception (even though this exception
* will not throw for the code implemented above!)
*/
return sharedState.map(Lock::new).orElseThrow(LockAcquisitionException::new);
}

protected T attemptLockAcquisition(T state) {
try {
return stateRepository.create(state);
} catch (LockIsInUseException lockIsInUseException) {
return null;
} catch (Exception exception) {
throw exception;
}
}

public void release(Lock<T> lockedState) {
stateRepository.remove(lockedState.state());
}
}

(See GitLab code here.)

Listing 3 - The LockStateService class. This is a proxy between the mutual exclusion algorithm and the state's repository. In order to change the dynamics in the lock acquisition, for example, by adding a timed executor or a limited amount of retries, this class is the one you should extend.

/**
* An interface for a generic state repository. The operations described here MUST
* be ATOMIC, CONSISTENT, ISOLATED and DURABLE, in the terms of distributed systems.
*
* Good candidates are:
*
* - Single-instance Web servers with synchronous REST APIs enabled;
* - Any database systems that support transactions (at least in insertion level).
*
* @param T A class whose objects' states should be considered candidates
* for race conditions.
*/
interface StateRepository<T> {

/**
* Atomic operation to create a single and distributed state object at the data repository.
* If one process tries to insert a repeated state, then an exception should be thrown.
*
* @param state The state to be persisted atomically in the data repository.
* @see LockInUseException An example class to be caught when the state can't be locked.
*/
T create(T state);

/**
* ~Atomic operation~ to remove a single and distributed state object from the data
* repository.
*
* This operation has no critical need to be atomic, as long as consumers can
* accept some threads waiting longer to be finished. This can happen on
* data repositories that support soft-state, like MongoDB.
*
* @param state The state to be removed from the data repository.
*/
void remove(T state);
}

(See GitLab code here.)

Listing 4 - The StateRepository class. This is the component that will persist the state and share it across the distributed system. If you're willing to change the persistence unit, database, etc., this class is the one you should extend.

Listings 1, 2, 3, and 4 implement a basic version of the design pattern with no concerns as to which data repository we're using to synchronize the transactions' states. There are also some additional classes, like exceptions, that can support the reader in any future work. For more details, please refer to our source code repository [7].

Conclusion

There are several use cases for a distributed critical section. Although the term scares many graduates of computer science programs, we see that there are no mysteries in its implementation. However, some readers might notice a small gap in the text.

The given example features a replicated document in two different services, which is a concerning design option when it comes to microservices. First: it breaks encapsulation. Second: it makes us lose performance by synchronizing threads across multiple instances. However, there are some use cases where this feature is necessary. The authors worked on a project where not only one, but two distributed critical sections were created. This means we implemented a chain of transactions, which guaranteed ACID properties when updating replicated documents in two different services.

There are some other situations where this solution might come in handy. For instance, suppose you're using a NoSQL database that does not support transactions. Now, suppose that you have an endpoint in your application that updates a document and all of its relationships at once. In case another request to update this dataset arrives at the same time, we can use a distributed critical section to synchronize them. This will work in the same way as the transactions provided by SQL databases, but at the application level instead (though this makes me wonder if you're really wanting to use a NoSQL database, after all…).

However, one downside of the proposed solution is the dependency on external data storage, adding one point of failure to our application. Furthermore (and maybe this is a challenge for the future), we can use the StateRepository to implement a consensus algorithm. In this context, we'll go from having an extra point of failure to breaking a microservice's instance encapsulation. To implement consensus, the service's nodes will have to know at least some level of information about each other's states. 

In summary, if we may leave some final recommendations for you, our reader, we would say this: pay attention to your system's requirements. There are more problems and scenarios to deal with when designing distributed systems than when working with centralized applications [6]. The possibilities may encourage you to seek for newer solutions, but sometimes they just don't fit the challenges you'll face. Whenever you're confronted with hard decisions, remember the CAP theorem and the cost you will have when implementing a distributed critical section (just in case).

 

This article was co-written by Nilson Aguiar.

 

References

[1] MICROSOFT. Saga distributed transactions. Microsoft Documentations, Jul. 21st 2020. Available on: <https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/saga/saga>. Last accessed: Jan 18th 2021. 

[2] E. W. Dijkstra. Solution of a Problem in Concurrent Programming Control. Communications of the ACM, v.8, n.9, Sep. 1965. Available on: <https://www.di.ens.fr/~pouzet/cours/systeme/bib/dijkstra.pdf>. Last accessed: Jan 18th 2021.

[3] M. Fowler. Microservices. MartinFowler.com, Mar. 2015. Available on: <https://martinfowler.com/articles/microservices.html>. Last accessed: Jan 18th 2021.

[4] IBM. CAP Theorem. IBM Cloud Education, Nov. 2019. Available on: <https://www.ibm.com/cloud/learn/cap-theorem>. Last accessed: Jan 18th 2021.

[5] C. Roe. ACID vs. BASE: The Shifting pH of Database Transaction Processing. Mar. 2012. Available on: <https://www.dataversity.net/acid-vs-base-the-shifting-ph-of-database-transaction-processing/>. Last accessed: Jan 18th 2021.

[6] J. Gabrielson. Challenges with Distributed Systems. Amazon Web Services, INC., 2019 . Available on: <https://aws.amazon.com/builders-library/challenges-with-distributed-systems/>. Last accessed: Jan 18th 2021.

[7] N. Aguiar and R. Novaes. Java Concurrency Distributed Systems. GitLab, Jan. 2021. Available on: <https://gitlab.com/avenuecode/java-concurrency-distributed-systems >. Last accessed: Jan 18th 2021.

[8] B. Katwal. What is the CAP Theorem? MongoDB vs Cassandra vs RDBMS, where do they stand in the CAP Theorem? Medium.com, Sep. 28th, 2019. Available on: <https://medium.com/@bikas.katwal10/mongodb-vs-cassandra-vs-rdbms-where-do-they-stand-in-the-cap-theorem-1bae779a7a15>. Last accessed: Jan 21st 2021.


Author

Rodrigo Novaes

Rodrigo Novaes is a Java engineer at Avenue Code, which might make you wonder why he's writing snippets about C++'s features. He's also postponing his master's for two years now (maybe he'll get it someday). Apart from that, he's really fond of computer science and science as a whole, which is why he's constantly deep into academic things involving programming and mathematics. In his free time, he plays The Legend of Zelda and reads One Piece (which you should also do!).


Processing Messages with Spring Cloud Stream and Kafka

READ MORE

Quarkus is the Answer to Cloud Native Tech

READ MORE

How to Build Microservices with Helidon

READ MORE