The Global Consistency Nightmare: Why Distributed Transactions Are Dead and CRDTs Are the Only Viable Path for Web Scale

Distributed transactions are the architectural debt incurred when demanding synchronous ACID properties across a globally partitioned state. Latency is not the enemy; blocking latency is, and Two-Phase Commit (2PC) ensures maximal blocking.

For systems operating beyond a single geographical region, or those requiring sub-100ms response times under high write contention, the attempt to enforce global serializability through locking protocols is not merely inefficient—it is an act of self-sabotage that yields a system fundamentally unscalable.

The Iron Coffin of Two-Phase Commit (2PC)

2PC, and its slightly more paranoid sibling 3PC, attempts to fake atomicity across independent services (participants) coordinated by a central service (the coordinator). This architecture is a direct trade-off: you gain atomicity across boundaries, but you pay a brutal tax in availability, latency, and fault tolerance.

Why 2PC Must Fail at Scale

Imagine a typical transaction: reserving inventory, charging a payment service, and updating a loyalty ledger. Three participants, one coordinator.

Phase 1: Prepare/Voting (O(N) synchronous calls): The coordinator asks all N participants if they are ready to commit. Every participant must lock the necessary resources, write the transaction intent to a durable log, and signal 'Yes.' If even one service fails here, the entire transaction rolls back.
Phase 2: Commit/Rollback (O(N) synchronous calls): If all participants voted 'Yes,' the coordinator issues the global commit signal. All participants must then finalize the transaction and release their locks.

The performance implications are horrifying:

Lock Contention: Resources are held hostage for the duration of two full network round trips plus synchronous disk I/O for logging the intent and the final decision.
The Single Point of Failure: If the coordinator fails after issuing the 'Prepare' but before issuing 'Commit,' participants are left in an indeterminate state. They must wait for the coordinator to recover (or for human intervention) because the locks cannot be released—releasing them would violate atomicity. This guarantees blocking latency and severely limits availability ($A \approx 0$ under coordinator failure).
The Network Split Trap: A partial network failure isolating the coordinator from a participant can lead to deadlock or inconsistent resolution, often requiring tedious manual reconciliation.

In short, 2PC transforms the complexity of distributed state into the complexity of managing distributed locks and distributed consensus protocols, effectively reducing the throughput of $N$ services to the minimum throughput of the slowest service, gated by the failure characteristics of the coordinator.

The Mathematical Escape Hatch: CRDTs

If global ACID is the requirement, 2PC is the painful answer. But what if we fundamentally challenge the requirement? Instead of demanding that operations must always execute in a total order (serializability), we accept that they will happen concurrently and asynchronously, but we demand that the resulting state must always converge—no matter the order of operations or network partitions.

This is the promise of Conflict-Free Replicated Data Types (CRDTs).

CRDTs are data structures that, by design, possess two key mathematical properties:

Commutativity: $A \circ B = B \circ A$. The order in which operations (or merges) are applied does not matter.
Associativity: $(A \circ B) \circ C = A \circ (B \circ C)$. Merging groups of states yields the same result.

Because these properties hold, two replicas that start in the same state and receive the same set of updates (even out of order) are mathematically guaranteed to converge to the identical, correct state without requiring locking, consensus protocols, or transaction rollbacks.

Operation-Based vs. State-Based CRDTs

There are two main families:

State-Based (CvRDTs): Replicas exchange their full current state. Merging is achieved by taking the set union or the maximum value. Excellent for low-churn data where network bandwidth isn't constrained by state size.
Operation-Based (CmRDTs): Replicas exchange operations (small deltas) rather than the entire state. These operations must carry sufficient context (like unique IDs or Vector Clocks) to be applied remotely. Generally preferred for high-volume, collaborative applications.

Let’s examine a classic example: The Positive-Negative Counter (PN-Counter).

If you use a simple integer counter distributed across nodes, a single Increment operation received out of order or concurrently will be lost. The PN-Counter solves this by partitioning the state into two sets of counters (one for increments, one for decrements), indexed by the node ID. The total value is the sum of all positive counters minus the sum of all negative counters.

Production Code: The PN-Counter Merge

This implementation focuses on how state convergence works without locks. Each node maintains its own list of increments ($P$) and decrements ($N$).

// Internal representation of a PN-Counter
type NodeID = string;

interface PNCounter {
    // P tracks increments by node ID
    P: Record<NodeID, number>; 
    // N tracks decrements by node ID
    N: Record<NodeID, number>;
}

// Operation: Increment on Node 'A'
function increment(counter: PNCounter, node: NodeID) {
    counter.P[node] = (counter.P[node] || 0) + 1;
}

// Operation: The Merge function (the core of the CRDT)
function mergePNCounters(local: PNCounter, remote: PNCounter): PNCounter {
    const mergedP: Record<NodeID, number> = { ...local.P };
    const mergedN: Record<NodeID, number> = { ...local.N };

    // Merge P (positive increments)
    for (const remoteNode in remote.P) {
        const localMax = local.P[remoteNode] || 0;
        mergedP[remoteNode] = Math.max(localMax, remote.P[remoteNode]);
    }

    // Merge N (negative decrements)
    for (const remoteNode in remote.N) {
        const localMax = local.N[remoteNode] || 0;
        mergedN[remoteNode] = Math.max(localMax, remote.N[remoteNode]);
    }

    return { P: mergedP, N: mergedN };
}

// Calculate the final value
function value(counter: PNCounter): number {
    const sumP = Object.values(counter.P).reduce((a, b) => a + b, 0);
    const sumN = Object.values(counter.N).reduce((a, b) => a + b, 0);
    return sumP - sumN;
}

The magic is in the Math.max() merge strategy. Because the counter only ever increases its specific component (P or N), simply taking the maximum value seen by any replica guarantees that no updates are lost, regardless of the order they arrive. This function is inherently commutative and associative.

This model is critical for scenarios like social media likes/upvotes, shared inventory tracking (where slight over-commitment is acceptable latency trade-off), or collaborative text editors.

The CRDT Gotchas: The Trade-Offs You Accept

CRDTs are not a silver bullet. They trade locking complexity for state complexity. Before adopting them, developers must be acutely aware of the following production realities:

1. State Explosion and Compaction Debt

Many common CRDTs, particularly set-based ones like the Grow-Only Set (G-Set) or the internal state of a PN-Counter, only ever grow. They accumulate state to guarantee that no information is lost.

If you use an Observed-Remove Set (OR-Set) to model a shopping cart, every item that is ever removed (deleted) is not truly erased. Instead, a tombstone is created—a record that indicates the item was removed at a specific version/time. These tombstones must persist until all replicas have acknowledged the removal, preventing the item from spontaneously reappearing if an older update arrives later.

The Trap: If you don't implement aggressive and costly state compaction routines (garbage collection), your data structures will grow indefinitely, leading to massive memory pressure and slow synchronization times. Compaction often requires temporary global coordination, partially defeating the non-blocking promise.

In a synchronous system (like 2PC), a failure is immediate and catastrophic; you see it in the error logs. In an eventual consistency system, a failure manifests as divergence—nodes drift out of sync slowly and silently.

Standard APM tools optimized for request/response cycles are insufficient. You need custom monitoring for consistency verification:

Lag Metrics: Monitoring the time difference between when an operation is committed on Node A and when the resulting state converges on Node B.
Divergence Checkers: Regularly running background jobs that compute the hash of the state on two replicas and alert if they differ, signaling a broken merge function or a bug in delta propagation.

3. The Definition of Identity

In CRDTs, handling identity and uniqueness is significantly harder than in a relational database. Since you cannot rely on synchronous consensus to issue a globally unique ID (like a sequential primary key), you must use UUIDs or compound IDs (Node ID + Lamport Timestamp).

This makes simple operations, like iterating over a list or ensuring uniqueness (e.g., unique username registration), much harder, often requiring specialized CRDTs like the LSEQ (List CRDT) or bespoke merge strategies for constraint enforcement.

Verdict: When to Ditch 2PC Forever

Stop pretending you can afford synchronous, global serializability. If your application falls into any of the following categories, CRDTs are a mandatory investment, not an option:

Application Type	Required Consistency Model	Best Fit
Collaborative Editing (e.g., Google Docs, Figma)	Strong Eventual Consistency	CRDTs (specifically LSEQ, RGA)
Global E-commerce Carts	Weak Eventual Consistency (additive only)	G-Set, OR-Set
Gaming State/Leaderboards	Eventual Consistency	PN-Counters, Max-Min Registers
High-Velocity IoT Ingestion	Eventual Consistency (Mergeable Logs)	CRDTs / Append-Only Logs

CRDTs allow us to decouple the write path (low latency, high availability) from the read path (eventual consistency verification). This decoupling is the only viable path to building systems that serve millions of concurrent users globally without incurring catastrophic latency penalties or operational debt that bankrupts the business.

Embrace eventual consistency not as a failure state, but as the only correct architectural choice for web scale.