← Back to Blog

The Transaction That Wasn't: How Distributed Systems Lie About Consistency

· 5 min read
distributed systems databases consistency transactions postgresql mysql redis architecture backend engineering debugging

The Transaction That Wasn't: How Distributed Systems Lie About Consistency

Last Tuesday, I got a message at 2 AM. The kind you dread. "Payments are processing twice. Money is being charged to people twice."

Not ideal.

The code looked correct. It felt correct. We had a transaction wrapper, we were using PostgreSQL, the query had proper WHERE clauses. Everything that should be atomic was wrapped in BEGIN...COMMIT. Standard stuff.

But money was duplicating. Orders were ghost-shipping. Inventory counts were drifting off into la-la land.

After six hours of staring at query logs, connection states, and one very aggressive coffee habit, I found the culprit. And it wasn't in the application code. It wasn't even really in the database.

It was in the space between our application and our database. In the layer of connections that nobody thinks about until it breaks.

This is the story of distributed consistency, why ACID doesn't mean what you think it means, and what actually happens when you have "transactions" across multiple systems.

The Myth of the Atomic Operation

Let's start with something basic. You've got this:

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE user_id = 1;
UPDATE accounts SET balance = balance + 100 WHERE user_id = 2;
INSERT INTO transactions (from_user, to_user, amount) VALUES (1, 2, 100);
COMMIT;

This is a transaction. It should be atomic. The textbook says it's atomic. The PostgreSQL documentation says it's atomic.

The weird part is: it is atomic. Inside one PostgreSQL session.

The problem is that nobody operates "inside one PostgreSQL session."

You've got connection pooling. You've got read replicas. You've got Redis caching. You've got application servers scaling horizontally. You've got multiple processes that might be handling the same logical operation.

And in that environment, "atomic" becomes a much more complicated word.

What Connection Pooling Does to Your "Transactions"

Here's where most engineers get blindsided. You're using PgBouncer. Or HikariCP. Or some other connection pooler. These are great. They let you run 100 application threads while sharing 20 database connections.

But they introduce a subtlety that breaks assumptions.

When you do this:

with db.transaction():
    process_payment(order_id)
    update_inventory(order_id)
    send_confirmation(order_id)

You're assuming you're getting one session, one transaction, one atomic block. But under the hood, you're getting:

  1. Connection pool gives you connection A from the pool
  2. You begin transaction on connection A
  3. Some network blip causes a retry on a different connection
  4. Your "transaction" now spans two sessions
  5. PostgreSQL doesn't know they're related
  6. Both commits happen independently
  7. Your "atomic" operation just became two operations

Let me be concrete. Here's what actually happens in many production systems:

# Your code
def process_order(order_id):
    with connection_pool.acquire() as conn:
        with conn.transaction():
            # This might internally retry on a different connection
            # if the first one has issues
            result = execute_query(conn, order_id)

The issue is that connection poolers often silently reconnect. And when they do, your transaction boundaries don't follow you.

The Red Herring: SERIALIZABLE Doesn't Save You

Here's what surprises people: even SERIALIZABLE isolation level doesn't fix this. Because SERIALIZABLE only governs what happens within database operations. It knows nothing about your application's retry logic, your connection pool's reconnection behavior, or your multiple-service architecture.

You can set:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- your operations
COMMIT;

And PostgreSQL will faithfully serialize them. But if your application logic silently retries part of your operation on a new connection, you've already lost.

The Read Replica Problem

Here's another layer of the same problem. You're scaling reads to replicas:

# Load balancer routes reads to replicas
def get_order(order_id):
    return replica.query("SELECT * FROM orders WHERE id = ?", order_id)

def update_order(order_id, data):
    return primary.execute("UPDATE orders SET ... WHERE id = ?", order_id)

Classic CQRS (Command Query Responsibility Segregation). Reads go to replicas, writes go to primary.

Except now you've got temporal inconsistency baked into your architecture. You write to primary at T=0. Your replica might not have that write until T=5 (or T=500 if things are weird).

Someone reads their order status immediately after an update. They hit the replica. It shows the old data. Your application sees inconsistency and either:

  1. Shows stale data (confusing for users)
  2. Retries the operation (potentially duplicating work)
  3. Fails with a mysterious error (confusing for everyone)

This isn't a bug. This is intended behavior. Your database is functioning correctly. Your architecture just has an assumption baked in that data is immediately consistent everywhere, and that assumption is wrong.

Redis Caching: Where Your Database Integrity Goes to Die

Now let's add Redis into the mix, which is where things get really fun.

Standard caching pattern:

def get_user(user_id):
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))
    return user

Looks fine, right? Cache-aside pattern. Classic.

But now you've got the same data in two places. PostgreSQL and Redis. And they're not in a transaction together. They can't be.

What happens when:

  1. User updates their profile
  2. Database updates successfully
  3. Redis write fails
  4. Next read returns stale data
  5. User sees their old profile
  6. They update again
  7. Now you're corrupting user state

Or worse:

  1. User updates their profile
  2. Database update succeeds
  3. Redis delete succeeds
  4. But now there's a gap where no cache exists
  5. Load spikes hit the database
  6. Database starts choking
  7. Requests start timing out
  8. Your monitoring alerts you with cryptic "database connection pool exhausted" messages

The cache invalidation problem is real, but the consistency problem is worse. You now have two sources of truth, and no transaction in existence can span them.

What Actually Broke in My Payment System

Let me tell you exactly what happened with that payment system.

We had:

async def process_payment(order_id: str, amount: Decimal):
    async with db.transaction():
        # Deduct from payer
        await db.execute("""
            UPDATE accounts 
            SET balance = balance - %s 
            WHERE user_id = %s AND balance >= %s
        """, [amount, order_id, amount])
        
        # Add to payee
        await db.execute("""
            UPDATE accounts 
            SET balance = balance + %s 
            WHERE user_id = %s
        """, [amount, order_id])
        
        # Record transaction
        await db.execute("""
            INSERT INTO transactions (order_id, amount, status)
            VALUES (%s, %s, 'completed')
        """, [order_id, amount])

This looks atomic. It should be atomic. But we were using an async connection pool (asyncpg), and we had a pattern like:

async def execute_with_retry(query, params, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await pool.execute(query, params)
        except OperationalError as e:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(0.1 * (2 ** attempt))  # exponential backoff

Here's the bug: if the connection died mid-transaction, our retry logic would silently re-execute the query on a new connection. PostgreSQL had already committed the original transaction on the dead connection. We'd retry it on a fresh connection. The retry would start a new transaction.

So balance - 100 executes twice. On two different connections. PostgreSQL sees two independent transactions. Both succeed. User gets charged twice.

The database did exactly what we told it to do. The bug was in our retry logic, which didn't understand that it was re-executing part of an already-committed transaction.

The Fixes That Actually Work

1. Idempotency Keys

The correct solution isn't "make your transactions atomic across systems." That's impossible. The correct solution is "make your operations idempotent."

async def process_payment(order_id: str, amount: Decimal, idempotency_key: str):
    # Check if we've already processed this
    existing = await db.query("""
        SELECT * FROM processed_payments 
        WHERE idempotency_key = %s
    """, [idempotency_key])
    
    if existing:
        return existing  # Already done, return original result
    
    async with db.transaction():
        # All the payment logic
        await deduct_balance(...)
        await add_balance(...)
        await record_transaction(...)
        
        # Record with idempotency key
        await db.execute("""
            INSERT INTO processed_payments (idempotency_key, order_id, amount, status)
            VALUES (%s, %s, %s, 'completed')
        """, [idempotency_key, order_id, amount])

Now even if your code runs twice, you get the same result. The database becomes the source of truth for "has this operation completed?"

2. Explicit Connection Pinning for Transactions

If you need true transaction semantics across multiple operations, pin to a single connection for the duration:

async def atomic_operation():
    # Acquire exclusive connection for this entire transaction
    async with pool.acquire() as conn:
        async with conn.transaction():
            await conn.execute("UPDATE ...", ...)
            await conn.execute("UPDATE ...", ...)
            await conn.execute("INSERT ...", ...)
            # All on same connection, truly atomic

No connection switching mid-transaction. No retry logic that spans connection boundaries.

3. Saga Pattern for Distributed Operations

When you genuinely need atomicity across multiple services (payment + inventory + fulfillment), you need a saga:

async def place_order(order):
    try:
        await reserve_inventory(order)
        await charge_payment(order)
        await initiate_fulfillment(order)
    except PaymentFailed:
        await release_inventory(order)  # Compensating action
        raise
    except FulfillmentFailed:
        await refund_payment(order)  # Compensating action
        await release_inventory(order)
        raise

No magic "all or nothing" transaction. Instead, explicit compensating actions that rollback partial work.

The Mental Model You Need

Here's the shift you have to make:

Stop thinking about transactions as "atomic units." Start thinking about them as "consistency boundaries."

A consistency boundary is a region where your data is guaranteed to be in sync. Outside that boundary, assume eventual consistency at best.

Your PostgreSQL transaction is a consistency boundary. It works. Redis is outside that boundary. Your replicas are outside that boundary. Your other microservices are way outside that boundary.

When you understand this, you stop trying to make impossible things work ("let's make Redis and PostgreSQL atomic together") and start building systems that tolerate inconsistency:

  • Idempotent operations everywhere
  • Explicit conflict resolution
  • Compensating transactions
  • Optimistic locking with version checks
  • Read-your-own-writes patterns (if you need stronger guarantees, read from where you wrote)

The Real Lesson

The payment bug I described wasn't a PostgreSQL bug. It wasn't an asyncpg bug. It wasn't even really a code bug.

It was a mental model bug.

We assumed "transaction" meant "one logical operation that either fully succeeds or fully fails." But in a distributed system, that's only true within a single, pinned database connection. The moment you introduce connection pooling, retries, caches, or multiple services, that assumption shatters.

The systems we're building are fundamentally eventually consistent. Not because they're broken, but because that's the trade-off that makes scale possible. The database that guarantees strong consistency across data centers is slow. The cache that gives you microsecond reads sacrifices consistency.

Understanding this trade-off, and designing your application to match reality rather than fighting it, is what separates systems that fail spectacularly in production from ones that degrade gracefully.

Your ACID transaction is real and useful. Just understand exactly what it guarantees, and what it doesn't.

Now if you'll excuse me, I need to go add idempotency keys to some code I thought was correct six months ago.


The database doesn't lie. But it doesn't tell the whole truth either.