Multi-tenancy patterns that don't blow up at scale

The Three Isolation Models

There are exactly three ways to separate tenants, and every multi-tenant architecture is a variation or combination of them. Pretending there's a fourth is how you end up with a custom abstraction nobody understands.

Pool: all tenants share the same tables, same database, same compute. Isolation is logical — a tenant_id column and discipline. This is the cheapest per tenant by an order of magnitude. A pooled Postgres instance can hold 50,000 tenants where siloed would need 50,000 databases.

Silo: each tenant gets dedicated infrastructure — separate database, sometimes separate compute. Isolation is physical. Blast radius is one tenant. Cost is brutal: you're paying for idle capacity on every tenant that isn't actively hammering the system.

Bridge: shared compute, separate schemas or separate databases on shared servers. The middle ground. You get logical-to-physical separation at the data layer while sharing the expensive application tier.

The mistake is treating this as a one-time decision. The companies that survive scale run all three simultaneously, segmented by tenant tier. Your free tier is pooled. Your enterprise tier with a contractual data-isolation clause is siloed. The middle is bridge. If your architecture can't express that, you'll be rewriting it at Series B.

Row-Level Security Is a Loaded Gun

Pooled multi-tenancy lives or dies on one question: what happens when a query forgets the WHERE tenant_id = ? clause. The answer must never be "it returns another tenant's data." That's not a bug, it's a breach, and it's the single most common catastrophic failure in SaaS.

Application-level filtering — relying on every ORM call to include the tenant predicate — is not a control. It's a hope. One raw SQL query, one missing scope in a background job, one junior engineer copy-pasting a query, and you've leaked. I've seen it happen in a payments company. The postmortem was not fun.

Postgres Row-Level Security moves the predicate into the database. You set app.current_tenant as a session variable and define a policy:

ALTER TABLE invoices ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON invoices
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

Now the database refuses to return rows that don't match, regardless of what the application query says. The filter is unforgeable from the application layer.

The loaded-gun part: RLS does not apply to table owners or superusers by default, and it does not apply with BYPASSRLS. Your migration user, your connection-pool user, and your admin tooling must run as roles that cannot bypass RLS — or you've built a fence with a gate that's always open. Test this explicitly. Write an integration test that sets a tenant context, queries, and asserts it cannot see another tenant's seeded data. Run it in CI. The cost of RLS is roughly 2–5% query overhead from the policy evaluation. That's cheap insurance.

Tenant Context Must Be Unforgeable

The tenant identity has to be derived from something the client cannot manipulate. If a tenant ID arrives in a request header or body and you trust it, any authenticated user can pivot to any tenant by changing a number.

Derive tenant context from the authenticated principal. The JWT, the session, the API key — these are signed or server-controlled. The mapping from principal to tenant lives server-side. A user in tenant A presents a token; your auth middleware resolves that token to tenant A and sets the database session variable. The user never names their own tenant.

The failure mode here is subtle in systems that support tenant switching — consultants, agencies, admin panels. The moment you allow "act as tenant X," you need an authorization check that this principal is permitted to act as X, and that check must be server-side and audited. Most cross-tenant leaks in mature systems happen here, not in the data layer. The RLS holds; the context-setting logic gets fooled.

Make the tenant context a single chokepoint. One middleware, one function that sets app.current_tenant. Every request flows through it. Background jobs serialize the tenant ID into the job payload and re-establish context on pickup. If a job runs without tenant context, it should crash loudly, not run with a null tenant that RLS interprets as "see everything."

Noisy Neighbors and Resource Fences

Pooled tenants share CPU, memory, IOPS, and connection slots. One tenant running a report that scans 40 million rows degrades latency for everyone. This is the defining operational pain of pooled multi-tenancy, and it's invisible until one tenant 10xs their usage.

You need per-tenant resource fences. At the application layer, that's per-tenant rate limiting and concurrency caps — not global limits. A token bucket keyed by tenant ID prevents one customer from consuming your entire request budget. Set the cap based on their tier, enforce it before the request hits the database.

At the database layer, the blunt instrument is a statement timeout. Set statement_timeout aggressively — 5 to 10 seconds for OLTP paths — so a runaway query gets killed rather than holding locks and connections. Tenants that legitimately need long-running analytics get routed to a read replica or a separate analytics store, never the primary.

The deeper fix is workload separation. Heavy tenants graduate out of the pool. Build the plumbing to migrate a single tenant from the shared database to a dedicated one without downtime — dual-write, backfill, cut over, verify, drop the old rows. If migrating one tenant is a multi-day manual operation, you'll let noisy neighbors burn the pool down because the alternative is too painful. Make tenant migration a routine, scripted operation. That capability is worth more than any clever caching layer.

Connection Pools Are Where Pooled Dies

Here's the trap nobody warns you about. You adopt the silo or bridge model with one database per tenant, feeling good about isolation. Then you discover Postgres connections cost ~10MB each and max out around a few hundred to a few thousand. With a connection pool per tenant database, 2,000 tenants × a 10-connection pool = 20,000 connections your application is trying to hold. The database falls over.

Database-per-tenant does not scale the connection layer linearly, and people forget this until the connection storms start. PgBouncer in transaction-pooling mode helps, but transaction pooling breaks session-level features — including the current_setting session variable approach to RLS, because the session no longer maps to one client.

The workaround for RLS under transaction pooling is SET LOCAL inside the transaction, which scopes the variable to the transaction rather than the session:

BEGIN;
SET LOCAL app.current_tenant = '...';
-- queries
COMMIT;

This survives transaction pooling because the variable is set and used within the same transaction the pooler keeps stable. Get this wrong and you'll have tenant context bleeding between requests that reuse a pooled connection — which is exactly the cross-tenant leak you were trying to prevent. Test it under concurrency, not in isolation.

For pooled architecture, this is a non-issue: one database, one pool, SET LOCAL per transaction. This is a real reason pooled often wins operationally even when silo sounds safer. The connection math for silo gets ugly fast, and you end up needing a pooler-per-region and careful pool sizing that nobody wants to own.

Migrations Across Thousands of Tenants

Schema migrations are where the isolation model sends you the bill.

In pooled, you have one schema. A migration runs once. But a migration that takes a lock on a 200-million-row shared table locks it for every tenant simultaneously. You cannot do a naive ALTER TABLE ADD COLUMN ... DEFAULT that rewrites the table — it'll lock writes for minutes and take down all tenants at once. Every migration must be non-blocking: add nullable columns, backfill in batches, add constraints NOT VALID then VALIDATE concurrently, build indexes CONCURRENTLY. The discipline is higher, but you run it once.

In silo and schema-per-tenant bridge, each tenant has its own schema, so a migration must run N times. This sounds safer — you can canary the migration on internal tenants first, then roll forward in waves. It is safer for blast radius. But it introduces version skew: tenant 4,000's migration fails halfway and now your fleet has tenants on schema version 47 and tenants on version 48. Your application code must tolerate both versions simultaneously, because you can never migrate 10,000 schemas atomically.

That version-skew tolerance is the hidden tax of per-tenant schemas. Build a migration orchestrator that tracks per-tenant schema version, runs in controlled batches, retries failures, and reports the fleet's version distribution. Deploy application code that handles the union of the old and new schema before you start migrating, and clean up the compatibility code only after the fleet is fully converged. Skip this and one failed migration in the middle of the fleet wedges your deploy pipeline.

Choosing and Mixing Models

Default to pooled. It's the cheapest, the connection math works, migrations run once, and modern RLS makes it safe if you're disciplined. The vast majority of SaaS companies never outgrow a well-built pooled architecture for their long tail of tenants.

Move to bridge — schema-per-tenant or database-per-tenant on shared compute — when you have a concrete driver: a compliance requirement for data residency, a contractual isolation clause, or a tenant whose data volume distorts shared indexes. Don't move because it feels more enterprise. The operational cost is real and it compounds.

Reserve silo for tenants who pay enough to fund their own infrastructure and demand it explicitly. A siloed tenant is a pricing tier, not an architecture default. Charge for it.

The pattern that survives scale is the tier-routed hybrid: a routing layer that, given a tenant ID, knows which model and which physical database serves that tenant. A lookup table — tenant ID to shard, schema, and isolation model — that your connection layer consults on every request. This indirection lets you start everyone in pool, promote heavy or sensitive tenants to bridge or silo individually, and rebalance shards without application changes. The routing layer is the single most important piece of infrastructure in a multi-tenant system, and it's the one teams build last, after they're already in pain.

Build the router first. Make tenant migration between models a scripted operation. Enforce isolation at the database, not the application. Derive tenant context from the principal. Fence resources per tenant. Do those five things and your multi-tenancy survives the growth that kills everyone else's.