Semantic caching in production: what pgvector actually gets you

Sometime around the third month of running Supergate in production, I noticed a pattern in our logs: roughly 40% of the prompts hitting our LLM gateway were near-duplicates. Not identical — users would rephrase, swap a synonym, reorder clauses — but the semantic intent was the same. We were paying for the same answer, over and over.

The obvious response was to build a cache. The less obvious question was: what does a "cache hit" mean when the key is a paragraph of natural language?

Why traditional caching falls apart

A Redis lookup on the raw prompt string misses the moment someone writes "summarize this document" instead of "give me a summary of this document". Hashing the prompt is fast and deterministic, but it has zero tolerance for paraphrase. In a multi-tenant gateway processing prompts from dozens of different applications, exact-match caching is almost useless. Our hit rate was under 3%.

Semantic caching flips the model: instead of matching on the literal string, you embed the prompt into a vector and search for neighbors within a distance threshold. If a cached prompt is close enough, you return the stored response and skip the upstream LLM call entirely.

It sounds elegant. In practice, the interesting part is everything that can go wrong.

The distance threshold problem

The single most important parameter in a semantic cache is the similarity threshold. Set it too loose and you serve stale, incorrect answers. Set it too tight and your hit rate drops to zero. There is no universally correct value — it depends on your embedding model, your domain, and what "close enough" means for your users.

In Supergate, we started with a cosine similarity threshold of 0.95 and tuned it down carefully per tenant. Some tenants run factual Q&A where a 0.92 threshold is fine — the paraphrases genuinely map to the same answer. Others run creative generation where even 0.98 produces noticeable quality drift.

cache-lookup.ts

const findSemanticMatch = async (
  embedding: number[],
  tenantId: string,
  threshold: number
) => {
  const result = await db.execute(sql`
    SELECT response, 1 - (embedding <=> ${embedding}) AS similarity
    FROM cache_entries
    WHERE tenant_id = ${tenantId}
      AND 1 - (embedding <=> ${embedding}) >= ${threshold}
    ORDER BY embedding <=> ${embedding}
    LIMIT 1
  `);
  return result.rows[0] ?? null;
};

Why pgvector, not Pinecone

When I started Supergate, the default advice was to use a dedicated vector database. I chose pgvector instead, and I'd make the same call again. Three reasons:

Operational simplicity. Our application already runs Postgres. Adding pgvector is an extension, not a service. No new deployment target, no new failure mode, no new billing dimension.

Transactional consistency. Cache writes, tenant metadata updates, and usage tracking all happen in the same transaction. With a separate vector database, you're either building two-phase commits or accepting eventual consistency on your cache — which is a polite way of saying "sometimes you serve the wrong tenant's cached response."

Good-enough performance. With an HNSW index, pgvector handles single-digit-millisecond lookups on tables up to a few million rows. That's the scale range where a semantic cache lives — you TTL entries aggressively, so the active set stays compact.

"The best infrastructure decision is often the one you don't have to make. If your existing database can do the job, adding a new one is not simplicity — it's accidental complexity with a vendor sticker on it."

Cache poisoning and the model-version trap

The most insidious bug in a semantic cache isn't a wrong threshold. It's cache entries that were correct when they were written but are no longer valid because the upstream model changed.

When OpenAI ships a new GPT-4o snapshot, cached responses from the old version might subtly differ in formatting, reasoning depth, or factual accuracy. If your cache has no concept of model version, users randomly get old-model answers mixed in with new-model answers, with no visible explanation for why the "same" prompt sometimes gives different results.

We namespace every cache entry by a composite key: tenant ID + model identifier + model version. When a model version changes, the old cache entries naturally stop matching. It's a cold start on every model update, but it's a predictable cold start instead of a confusing quality regression.

The real win isn't latency

When I pitch semantic caching to teams, they assume the value proposition is speed. It's not. A cache hit saves 500-2000ms of LLM latency, but the gateway already streams the first token in 200ms. Users don't notice the difference as much as you'd think.

The real win is cost. In a multi-tenant gateway, LLM API spend is the dominant operational cost. A 35-40% hit rate on a well-tuned semantic cache cuts that spend by a third. At scale, that's the difference between a sustainable product and a slow bleed.

The second win is rate limit headroom. Upstream providers throttle by tokens-per-minute. Every cache hit is a request that doesn't count against the quota. During traffic spikes, the cache acts as a natural pressure valve — the most common prompts resolve locally while only the long-tail hits the upstream API.

What I'd do differently

If I were building this again from scratch, I'd add cache confidence scoring. Not every hit is equally trustworthy — a 0.99 similarity is much safer than a 0.93. Exposing that confidence to the calling application lets teams make their own tradeoff: serve the cached response but log a low-confidence flag, or fall through to the upstream model and backfill the cache.

That's the whole story. Semantic caching is not a research problem anymore. The moving parts are well understood. The hard part is the discipline of tuning thresholds per domain, versioning cache entries correctly, and resisting the temptation to cache everything. Like most infrastructure, the boring decisions are the ones that matter.