I was debugging a production incident last month when I realized the on-call engineer had no idea what was actually happening in their system.
The symptom was simple: users were seeing stale data. Orders showed as "pending" when they were actually shipped. The payment service had processed the charge, the fulfillment service had picked the order, but the customer-facing API still thought nothing had happened. Classic eventual consistency problem, except nobody had built an "eventual" system—they'd built a "consistent" system in their heads and then built something else entirely in code.
This happens constantly. Teams draw little boxes on architecture diagrams, draw arrows between them, and call it a design. They talk about "the database" and "the API" and "the queue" as if these are well-defined, consistent things. Then production happens, and they wonder why their "simple" order processing flow sometimes takes three retries, leaves duplicate records, or loses data they swear they wrote.
Here's the uncomfortable truth: you're probably running a distributed system already. And if you think you're not, you're just running one without knowing it.
The "It's Not Distributed" Delusion
Let's talk about the mental model most backend developers operate under.
You've got a Node.js service. It connects to PostgreSQL. It exposes a REST API. The frontend calls it. Simple.
Except that service is now distributing state across at least three separate processes (your app, the database connection pool, the database itself). When you add Redis for caching, you're distributing state across four places. When you add a message queue for async processing, you're distributing state across five. When your frontend makes two sequential API calls and you have multiple server instances behind a load balancer, you're distributing state across N+5 places where N is your replica count.
The database isn't a single source of truth. It's a distributed system of replication lag, connection pooling, transaction isolation levels, and storage engine internals that your ORM definitely doesn't understand.
I tested this theory recently. I asked a team what consistency model their primary database operated under. They said "ACID." I asked which isolation level. They didn't know. I asked if they were using synchronous or asynchronous replication. They thought that was a question for the DBA, not the application developer.
That's when I knew they were running blind.
Where State Actually Lives
Here's what engineers consistently underestimate: state leaks into places you don't think about.
Your load balancer? It's a stateful system. That "stateless" request goes to Server A. The next request might go to Server B. But Server B doesn't have the user's session data in its memory. So now you're distributing that state into Redis, or sticky sessions, or JWT tokens with embedded claims. Every one of those is a distributed consistency problem you're pretending doesn't exist.
Your database connection pool? That's not one database—it's potentially dozens of underlying connections that your ORM manages with a pool. You're not talking to a database; you're talking to a pool that talks to a database that replicates to read replicas that might be milliseconds behind. Your application code thinks it's consistent because you wrote SELECT * FROM orders WHERE id = $1, but that read might hit a replica that's not caught up yet.
I watched a team burn two days trying to figure out why their "read-after-write" pattern wasn't working. They'd write to the primary, then immediately read from their read replica. Sometimes the data was there. Sometimes it wasn't. The answer was simple: replication lag. But they hadn't thought about replication lag because they'd never thought about the fact that they were running a distributed system.
The CAP Theorem Hates Your Assumptions
Here's where it gets painful.
The CAP theorem says you can have Consistency, Availability, and Partition tolerance—but you can only guarantee two of the three in a distributed system. Most engineers know this. Most engineers ignore it.
They build systems that claim to provide all three. They deploy databases that promise strong consistency. They write code that assumes transactions work across services. And then partition happens—not because they're running in a weird network topology, but because partitions always happen—and their system does something undefined.
Real production partitions aren't exotic. They're:
- A Kubernetes node losing network connectivity
- A cloud region having an availability event
- Your message queue becoming temporarily unavailable
- A database replica falling behind
When this happens, your system makes choices. Does it stay available but serve potentially stale data? Does it fail requests to maintain consistency? Does it retry forever and hope the partition heals?
Most systems make these choices by accident, not design. The default behavior of your message queue, your database driver, and your application framework combine to produce behavior that nobody intended and nobody understands.
I watched a system last year that would, during network partitions, silently drop messages. The queue would return success. The producer would move on. The consumer would never see the message. Nobody knew until customers started complaining about missing orders and the team discovered that about 0.1% of messages had simply vanished during a 10-minute network hiccup.
The Transaction That Wasn't
Here's the specific failure mode that breaks most "distributed" systems.
You have two services: an order service and an inventory service. User places an order. Order service creates the order. Inventory service decrements the count. Both need to succeed, or neither should happen.
Most implementations look like this:
# Order service
def create_order(user_id, product_id, quantity):
# Reserve inventory (call inventory service)
inventory_service.reserve(product_id, quantity)
# Create order
order = Order.create(user_id=user_id, product_id=product_id, quantity=quantity)
return order
This is not a transaction. This is a hope.
What happens if the inventory reservation succeeds but the order creation fails? You have reserved inventory that nobody will ever use. What happens if the order creates but the inventory call times out? You have an order with uncertain inventory state. What happens if the inventory service has a bug and throws an exception after reserving but before responding? You're in an unknown state.
The real answer requires something like two-phase commit, sagas, or at-least-once delivery with idempotency. But most teams don't implement any of this because they don't think they're building a distributed system. They're just "calling another service."
This is why microservices fail. Not because microservices are bad, but because teams treat them as if they're just calling functions across a network when they're actually building distributed systems that require fundamentally different consistency semantics.
The Consistency Spectrum Nobody Taught You
Strong consistency is not the only option. It's not even the best option for most systems.
There's a spectrum:
Linearizability: Every operation appears to happen atomically at a single point in time. This is what you get from a single-node database with synchronous writes. This is slow and not partition tolerant.
Sequential consistency: Operations appear to happen in some sequential order, but that order doesn't need to correspond to real time. This is what most single-node databases give you inside a single primary.
Causal consistency: Operations that are causally related must happen in order. Operations that aren't causally related can happen in any order. This is weaker but faster.
Eventual consistency: Given enough time without new updates, all replicas will converge to the same value. This is what you get with async replication. It's fast but you can read stale data.
Read-your-writes consistency: After you write something, you'll read it back. This sounds trivial but is surprisingly hard to provide in distributed systems. If you write to the primary and immediately read from a replica, the replica might not have the data yet.
Most applications need read-your-writes consistency. Most don't get it because they don't know to ask for it.
Here's a concrete example. User creates an account. User immediately tries to log in. Login looks up the user by email. If that read hits a replica, and replication hasn't caught up, the user doesn't exist yet. User gets "invalid credentials." User is confused. User resets password. User is now angry.
This happens. A lot. And it's not a bug in the application—it's the natural behavior of eventual consistency systems. The bug is in the assumption that it wouldn't happen.
What You Actually Need To Build
Here's the practical part.
If you're building any system with more than one service, more than one database, or more than one server, you need to:
Know your consistency model. What does your database guarantee? What does your message queue guarantee? What does your cache guarantee? These don't need to be strong, but they need to be known. Undefined consistency is the enemy.
Design for partial failures. Every network call can fail. Every service can be unavailable. Your system needs to handle this gracefully, not by retrying forever or by propagating errors you don't understand.
Use idempotency everywhere. In distributed systems, "did my write succeed?" is often unanswerable. The question you should ask instead is "can I safely retry this operation?" Idempotent operations can be retried without side effects. Build that.
Implement proper saga patterns for multi-service operations. If you have a business operation that spans multiple services, you need either:
- Two-phase commit (if you can live with its limitations)
- Saga pattern with compensating transactions
- Event-driven consistency with reconciliation
Measure your actual consistency guarantees. Use tools like Jepsen if you're serious. Or at least understand your replication topology and measure your actual replication lag in production.
Design for the partition. Partitions will happen. Your system needs to either:
- Stay available and accept that some operations will succeed with stale data
- Stay consistent and fail operations during partitions
- Gracefully degrade based on what's actually important
The System You're Already Running
Let me close with this.
The infrastructure you're already running is more complex than you think. That Kubernetes cluster? It's a distributed consensus system running on top of distributed storage, running on top of distributed networking. Those etcd nodes that manage your cluster state? They're running the Raft consensus algorithm to give you the illusion of a single, consistent control plane.
Your "simple" three-tier architecture is a distributed system. Your "stateless" microservices are distributed. Your "single source of truth" database is distributed across replicas, partitions, and storage engines.
The question isn't whether you're running a distributed system. The question is whether you're doing it on purpose.
Most engineers aren't. They're building distributed systems by accident and then being surprised when they behave like distributed systems.
The fix isn't to avoid complexity. It's to understand the complexity you're already carrying. Know what consistency means. Know where state lives. Know what happens when things break.
Because things will break. That's not pessimism, that's engineering. The question is whether your system degrades gracefully or catastrophically.
Make it degrade gracefully.