Postgres as your AI memory layer: pgvector in production

Why Postgres, Not a Vector DB

Every AI memory layer ends up doing three things: store embeddings, retrieve them by similarity, and filter them by metadata. The marketing for Pinecone, Weaviate, and Qdrant focuses entirely on the first two and quietly ignores the third. In production, the third is where you spend your time.

We ship AI agents that need persistent memory across sessions, user-scoped recall, and tenant isolation. Every one of those constraints is a WHERE clause. A dedicated vector database treats metadata filtering as a second-class citizen bolted onto an ANN index. Postgres treats it as the entire point of its existence.

The other reason is operational gravity. You already run Postgres. You have backups, replicas, connection pooling, migrations, and an on-call team that knows how to read pg_stat_statements. Adding a vector database means a second consistency model, a second backup story, and a second thing that pages you at 3am. pgvector folds vectors into the database you already operate.

The trade-off is real: pgvector will not match a purpose-built ANN engine at billion-vector scale. But the crossover point where that matters is much higher than vendors imply. We run pgvector for workloads up to ~50M vectors per table with sub-100ms p95 recall. Below that scale, the dedicated database is solving a problem you don't have while creating three you didn't.

The Schema That Actually Scales

The naive schema is one table: id, content, embedding, metadata jsonb. It works for a demo and falls apart in production for two reasons — index bloat and filter cardinality.

Start with explicit columns for anything you filter on. Do not bury tenant_id, user_id, or created_at inside a jsonb blob. Postgres can index jsonb, but the planner's cardinality estimates on jsonb expressions are garbage, and you will get sequential scans on your hottest queries.

CREATE TABLE memory (
  id          bigint GENERATED ALWAYS AS IDENTITY,
  tenant_id   bigint NOT NULL,
  user_id     bigint NOT NULL,
  kind        text NOT NULL,
  content     text NOT NULL,
  embedding   vector(1536) NOT NULL,
  metadata    jsonb NOT NULL DEFAULT '{}',
  created_at  timestamptz NOT NULL DEFAULT now()
) PARTITION BY LIST (tenant_id);

Partition by tenant_id when you have multi-tenancy. This is the single highest-leverage decision in the whole design. A vector index over one tenant's 200K rows is dramatically faster and smaller than a shared index over 50M rows, and it lets the planner prune entire partitions before it ever touches an ANN index. It also makes per-tenant deletion a DROP TABLE instead of a vacuum-inducing bulk delete.

Keep the vector column at the dimensionality your embedding model actually emits. If you're on text-embedding-3-small at 1536 dims, store 1536. Do not store the full vector and a truncated copy unless you've measured that you need both. Each vector(1536) row is roughly 6KB, and that storage multiplies fast across indexes.

Store the content alongside the embedding. The temptation is to keep text in one table and vectors in another, joined by id. Don't. Retrieval almost always needs the content immediately, and the join adds a round trip and kills the planner's ability to fetch everything in one index scan.

HNSW vs IVFFlat: Pick Correctly

pgvector gives you two index types and they are not interchangeable.

IVFFlat partitions vectors into lists and searches the nearest few. It builds fast, uses less memory, and degrades gracefully. Its fatal flaw: you must build it after you have representative data, because the list centroids are computed from whatever's in the table at build time. Build it on an empty or unrepresentative table and recall collapses. The lists parameter should be roughly rows / 1000 for under a million rows, then sqrt(rows) beyond that.

HNSW builds a navigable graph. It has the best recall-to-latency curve, it doesn't need representative data to build, and it handles incremental inserts well. The cost is memory and build time — an HNSW index can take 10x longer to build and consumes far more RAM. The m (default 16) and ef_construction (default 64) parameters trade build time for recall.

Use HNSW unless you have a specific reason not to. The reason to fall back to IVFFlat is memory pressure: if your index doesn't fit in shared_buffers plus OS cache, every query hits disk and your p95 falls off a cliff. HNSW's graph is larger than IVFFlat's list structure, so on memory-constrained instances IVFFlat wins despite worse theoretical recall.

CREATE INDEX ON memory USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

Match the operator class to your distance metric. vector_cosine_ops for cosine, vector_l2_ops for Euclidean, vector_ip_ops for inner product. Using the wrong one doesn't error — it silently returns wrong results. If your embeddings are normalized (most modern models emit normalized vectors), inner product and cosine are equivalent and inner product is faster.

The query-time knob for HNSW is ef_search. Raise it to improve recall, lower it to cut latency. Set it per-query with SET LOCAL hnsw.ef_search = 100 inside a transaction. Default is 40, which is usually too low for production recall.

Filtering: The Part Everyone Gets Wrong

Here's the failure mode that surprises everyone. You write:

SELECT id, content FROM memory
WHERE tenant_id = 42 AND kind = 'note'
ORDER BY embedding <=> $1
LIMIT 10;

With an HNSW index, Postgres does the ANN search first, walking the graph to find candidates, then applies your WHERE filter to those candidates. If kind = 'note' is highly selective and most graph neighbors are some other kind, you can walk the entire graph and still return fewer than 10 rows. This is post-filtering and it produces incomplete results, not just slow ones.

pgvector's answer is iterative index scans, available in recent versions. Set hnsw.iterative_scan = 'relaxed_order' and the index keeps pulling candidates until it satisfies your LIMIT after filtering. This fixes correctness but costs latency on selective filters.

The better fix is to make your filter a partition prune or a cheap pre-filter. Partitioning by tenant_id means the tenant_id = 42 predicate eliminates partitions before the ANN scan runs — that's not filtering, that's targeting a smaller index. For lower-cardinality filters like kind, consider partial indexes:

CREATE INDEX ON memory USING hnsw (embedding vector_cosine_ops)
  WHERE kind = 'note';

Now queries filtering on kind = 'note' use a dedicated index containing only those rows, and the post-filtering problem disappears. The cost is index proliferation, so only do this for the handful of filter values that dominate your query patterns.

Measure your filter selectivity before choosing a strategy. A filter that keeps 80% of rows is fine with a single index. A filter that keeps 0.5% of rows will destroy ANN recall without partial indexes or partitioning.

Keeping Embeddings Fresh

Embeddings rot. Your model changes, your chunking strategy changes, or the source content updates. You need a reembedding path that doesn't take the table offline.

Track the model version on every row. Add an embedding_model text column. When you change models, you can't compare vectors across model versions — cosine similarity between text-embedding-3-small and text-embedding-3-large outputs is meaningless. So a migration means reembedding everything, and during that window you have a mixed table.

The pattern that works: add a new vector column for the new model, backfill it in batches with a separate worker, build the new index concurrently, then atomically switch queries over and drop the old column. CREATE INDEX CONCURRENTLY is mandatory here — a plain CREATE INDEX takes an ACCESS EXCLUSIVE lock and blocks all writes for the entire build, which on a 10M-row HNSW index is tens of minutes of downtime.

For incremental updates, don't reembed on every content edit synchronously. Mark the row dirty with an embedding_stale boolean flag and let a background worker batch the reembedding. Embedding API calls are slow (50-200ms each) and rate-limited; doing them in the request path couples your write latency to a third-party API.

Vacuum matters more than you'd think. HNSW indexes accumulate dead tuples on update-heavy tables, and a bloated graph degrades both recall and latency. Tune autovacuum_vacuum_scale_factor down for vector tables — the default 0.2 means vacuum only triggers after 20% of the table is dead, which is far too lazy for a hot memory table.

Performance Numbers From Production

Concrete numbers from a multi-tenant agent platform, so you can calibrate.

Table: 50M total rows across ~2000 tenant partitions, vector(1536), HNSW with m=16, ef_construction=64. Instance: 16 vCPU, 64GB RAM, gp3 storage. The per-tenant working set fits in RAM because of partitioning — that's the whole game.

With ef_search=40, p50 query latency is 8ms, p95 is 34ms, recall@10 against exact search is ~0.94. Bump ef_search to 100 and p95 climbs to 71ms while recall@10 reaches ~0.99. We run ef_search=80 as the default — recall ~0.98, p95 ~55ms. For agent memory, 0.98 recall is indistinguishable from perfect; the agent's downstream reasoning absorbs the occasional missed neighbor.

Index build on a single 200K-row partition takes about 4 seconds with HNSW. The full table never gets one index — that's the partitioning payoff.

Write throughput: inserts with HNSW index maintenance run ~3000 rows/sec on this instance, which is plenty for agent memory writes. If you're doing bulk ingest, drop the index, load, then rebuild concurrently — bulk insert with the index live is 5-10x slower.

Memory is the constraint you'll hit first. An HNSW index over 50M vectors at 1536 dims is roughly 300GB. You will not keep that hot. Partitioning means you only keep active tenants' indexes hot, and cold partitions page in on demand with a one-time latency hit. Monitor pg_statio_user_indexes for your index hit ratio — below 0.99 and you're going to disk too often.

When to Actually Leave

Be honest about the exit criteria so you don't leave too early or too late.

Leave when a single tenant exceeds tens of millions of vectors and you can't partition further. Partitioning saves you because each query touches a small index; once one logical scope is itself huge, you're back to scanning a giant graph and pgvector's single-threaded index scan becomes the bottleneck.

Leave when you need features Postgres doesn't have: sharded distributed ANN across nodes, sub-5ms p99 at massive scale, or built-in hybrid sparse-dense retrieval with learned fusion. pgvector plus pg_trgm or full-text gives you serviceable hybrid search, but it's manual reciprocal-rank-fusion in application code, not a first-class feature.

Leave when index build time blocks your operations. If reembedding your corpus means a 6-hour concurrent build that you run weekly, a database with faster distributed builds pays for itself.

Don't leave because a benchmark says Qdrant is 3x faster at 10ms. If your p95 is 55ms and your SLA is 200ms, that speed is irrelevant and you're trading operational simplicity for a number nobody feels. The right answer for most AI memory layers is the database you already run, tuned correctly. Reach for the dedicated engine when you've measured a real wall, not before.