Microservices — Transaction Management Patterns

7 min readApr 24, 2022

Some Popular Microservices Transaction Management Patterns and Distributed System Problems

Write-ahead logging (WAL): provides atomicity and durability (ACID) in database systems. The changes are first recorded in the log, which must be written to stable storage, before the changes are written to the database.

Undo log: allows a dbms to undo a transaction. a version of modified data is saved before a transaction starts. The generation of undo logs is accompanied by the generation of redo logs.
Redo log: allows a dbms to redo a transaction. consists of files that store all changes made to the database as they occur. Every instance of an Oracle Database has an associated redo log to protect the database in case of an instance failure.

Two Phase Commit (2PC) ❌

Assumptions:
- One process is “coordinator” (C), the rest are “cohort”s (H).
- Coordinator can be initiator (or not).
- Storage is stable, no machine crashes.
- Write-ahead log exists at each site.

Phase 1 (Voting Phase):
- C sends a “query” msg to all H’s and wait for replies.
- All H’s work on trxn until commit phase. They write an entry to their undo log and an entry to their redo log. Each H sends “agree” or “abort” reply depending on their trxn’s outcome.

Phase 2 (Completion Phase):
Success:
- C sends a “commit” msg to all H’s.
- H’s complete operation, release all the locks and resources held during the transaction.
- H’s send “acknowledgement” msg to C. C completes trxn when all “acknowledgement” msgs are received.
Failure:
- C sends a “rollback” msg to all H’s.
- H’s undo trxn using the undo log, release all the locks and resources held during the transaction.
- H’s send “acknowledgement” msg to C. C undoes trxn when all “acknowledgement” msgs are received.

- not appropriate for microservices ❌
- if C fails, it's a single point of failure. 'Availability' impacted.
- depends on slowest response time, locks everywhere
- not supported by NoSQL dbs & msg brokers

Saga Pattern ✔️

Single independent unit of work
Every microservice should expose 1. participate 2. rollback (compensation) methods. (for n transaction -> n compensation)
If one in the chain fails, all the previously occurred ones’ rollbacks should be called.
Compensation service must be idempotent (cannot abort) and must be retried until success.

Orchestration Based:
One coordinator. Single point of failure problem still exists.
Choreography Based:
Each local trxn publishes domain events that trigger local trxns in other services. Preferred over orchestration-based.

- SAGA -> ACD ✔️ I ❌ : does not guarantee Isolation
  Countermeasures:
  * Versioning data (check version before update)
  * Re-read data (before modifying)
  * Semantic locking (lock flag preventing other trxns from accessing it)

CQRS (Command and Query Responsibility Segregation) Pattern

Queries (read)
Commands (update)

CQRS separates read and update operations to maximize performance. It is suitable especially for systems where reads >> writes and where stale data is no problem.

Separated databases -> C❌(AP✔️) Theorem

CQRS Implementation — https://betterprogramming.pub/cqrs-software-architecture-pattern-the-good-the-bad-and-the-ugly-e9d6e7a34daf

- complex
- database dependent, requires expertise in a variety of database technologies.
- using a large number of databases means more points of failure.
- need to handle data consistency in terms of CAP

Event Sourcing

All changes to app state are stored on a sequence of events (journal of domain events) (easy to roll back). (e.g. Axon)
Commonly combined with the CQRS pattern
Defines an approach where all the changes that are made to an object or entity (also known as the write model), are stored as a sequence of immutable events to an event store, as opposed to storing just the current state.

- suitable for event-driven architecture 
- makes it possible to reliably publish events whenever state changes
- audit log, easy to get state at a given time
- difficult to query event store (inefficient complex queries)

Transactional Outbox Pattern

microservices -> database transaction + publish events => should be atomic.
What happens in case of system failure? => Solution: outbox table (since this is also a database table)
There may be a scanning batch job to check if any such record exists (e.g column “status=PROCESSED”), read from there and then publish related event.

Change Data Capture (CDC)

It captures change events from a database log (or equivalent) (allowing application state to be externalized), forwards to consumers (e.g. Debezium).
Types: Change Columns (such as DATE_MODIFIED) (not good at capturing deletes), Trigger-Based (performance issues), Log-Based (most preferred)

Log-Based Change Data Capture

Directly integrated with database
MySQL, Postgres, etc. -> transaction log (write-ahead log) ->
Debezium -> Kafka Queues

Domain events — An explicit event, part of your business domain, that is generated by your application. These events are usually represented in the past tense, such as OrderPlaced, or ItemShipped. These events are the primary concern for Event Sourcing.

Change events — Events that are generated from a database transaction log indicating what state transition has occurred. These events are of concern for Change Data Capture.

Event Sourcing vs. Change Data Capture: Event Sourcing uses its own journal as the resource of reality, while Change Data Capture depends on the underlying database transaction log as the source of truth.

Comparison of Attributes — https://debezium.io/blog/2020/02/10/event-sourcing-vs-cdc/

Listen to Yourself Pattern

Consume the messages you yourself produced (db save comes later). => If the message was published to the message broker, you assume a consistent state.
All your events and database writes must be idempotent (as in every microservice regardless of the pattern you are using).

- The risk of consistency is more since db write occurs later.

Dead Letter Queue Pattern

Data (in the form of a message from a queue) cannot be processed due to:
- the lack of access to some component part of the system
- the input data is invalid or corrupt.
Solution => move the message to a “dead-letter queue”
design your system in a way that it not only limits the number of re-processing attempts, but also postpones the re-processing of failed messages (parking-lot queue) so as to allow other (potentially not accessible) parts of the system to become available.

Distributed System Models

Capture assumptions in a system model consisting of:

Network behaviour (e.g. message loss)
Node behaviour (e.g. crashes)
Timing behaviour (e.g. latency)

The Two Generals Problem

https://www.cl.cam.ac.uk/teaching/2122/ConcDisSys/dist-sys-notes.pdf

Messages may get lost.
Both parts cannot know if their msgs are received without receiving an answer from the other. => This results in an infinite chain of msgs to check certainity.

As a microservice example:

If payment succeeds but there are not enough goods in the shop => you can send an apology e-mail and refund card.
There are enough goods in the shop but the shop did not get any information about the payment’s outcome => ask payment service if it did actually charge that card.
With the steps above, online shopping avoids having two generals problem.

The Byzantine Generals Problem

Assumptions:
1. Messaging is reliable (messengers do not get captured and msgs get received).
2. Some of the generals are not reliable and may work together in this.

Aim: We want honest generals to succeed.

Up to f generals might behave maliciously.
Honest generals don’t know who the malicious ones are.
The malicious generals may collude.
Nevertheless, honest generals must agree on plan.

Theorem: need 3f+1 generals in total to tolerate f malicious generals
(i.e. < 1/3 may be malicious). Cryptography (digital signatures) helps but problem remains hard.