Distributed Commit & Replication

Concept. When one transaction touches several shards, two-phase commit makes it atomic across all of them: a coordinator first asks every shard to prepare, then tells them all to commit only if every shard voted yes. Replication keeps copies in sync, and the choice of synchronous or asynchronous sets whether a replica is guaranteed current or merely close.

Intuition. Locking (2PL) keeps one machine's transaction correct, but it says nothing about whether three machines agree to commit together. Two-phase commit adds that agreement. Replication then decides the price of safety: wait for the copy before you say "done" (no data lost, slower), or say "done" now and let the copy catch up (fast, but a crash can lose the newest writes).

A transfer moves money from an account on Shard 1 to an account on Shard 2. Each shard can lock its own row and log its own change, but neither can see the other. If Shard 1 commits and Shard 2 crashes, the money vanishes. Single-node atomicity is not enough; the shards need a protocol to commit or abort as one.

Two-Phase Commit

Phase one: a coordinator asks every shard to prepare and each logs its change and votes yes; phase two: the coordinator tells every shard to commit and each acknowledges, so the transaction is atomic across shards.

Figure 1. Two-phase commit turns many shards into one atomic decision. In the prepare phase the coordinator asks each shard whether it can commit; a shard logs its change durably and votes yes, or votes no. The coordinator commits only if every vote is yes, logs that decision, then in the commit phase tells all shards to apply it. One no anywhere forces a global abort, so the transaction is all-or-nothing across machines.

Two-phase locking and two-phase commit are different tools, and the names mislead. 2PL orders conflicting operations within one node so the schedule is serializable. 2PC gets separate nodes to agree on a single commit-or-abort outcome. A multi-shard transaction needs both: 2PL on each shard for isolation, 2PC across shards for atomicity.

The blocking problem

The coordinator is a single point of failure at the worst moment. If it crashes after the shards prepare but before it sends the decision, each prepared shard is stuck. It voted yes, so it cannot unilaterally abort; it has no decision, so it cannot commit. It holds its locks and waits for the coordinator to come back. Other transactions that need those rows wait too.

Why 2PC alone is fragile

A prepared shard that loses contact with the coordinator blocks holding locks until the coordinator recovers. Production systems remove that single point of failure by running the commit decision through a replicated consensus group (Raft or Paxos), so a new coordinator can be elected and finish the protocol. Google Spanner commits this way.

Synchronous vs Asynchronous Replication

Replication keeps copies of a shard on other machines. The open question is when the primary tells the client a write is safe: after the replica confirms it, or right away.

Synchronous replication waits for the replica to acknowledge before replying OK to the client, losing nothing on failover but adding latency; asynchronous replication replies OK immediately and ships to the replica later, which is fast but can lose the un-shipped tail if the primary fails.

Figure 2. The two modes trade latency against durability. Synchronous replication forwards each write to the replica and waits for its acknowledgement before replying OK, so a failover to the replica loses no committed write, at the cost of a full round trip per write. Asynchronous replication replies OK as soon as the primary has the write and ships it to the replica in the background, which is fast but means a primary crash can lose any writes that had not yet shipped. The gap between primary and replica is replication lag.

Reading from a replica is stale, not wrong

Read replicas soak up read traffic, but an asynchronous replica is always a little behind. A query there sees a consistent snapshot (MVCC still gives it one self-consistent view), just an older one, as of whatever the replica has applied. The practical trap is read-your-writes: you write to the primary, immediately read from a replica, and your own write is not there yet because it has not replicated. The data is consistent, simply not current.

Picking a mode

Need zero data loss on failover (payments, the system of record)? Replicate synchronously and pay the latency. Serving analytics or feeds where a few seconds behind is fine? Replicate asynchronously and read from replicas. Most large systems do both: synchronous within a region for durability, asynchronous across regions for reach.

Common Confusions

Q: Is two-phase commit the same as two-phase locking?
No. 2PL is per-node concurrency control for serializability; 2PC is a cross-node agreement on commit or abort. A multi-shard transaction uses both.

Q: Does 2PC need a quorum like Raft?
Plain 2PC needs every participant to vote, so one slow or dead shard stalls it. Pairing the coordinator with a consensus group fixes coordinator failure, not participant failure.

Q: Can a read replica ever return inconsistent data?
Under MVCC it returns a consistent snapshot, just a stale one. What you do not get from an async replica is read-your-writes or the very latest commit.