Vishal Vaibhav

When collision is good: semantic query caching with LSH

Wed, 27 May 2026 00:00:00 GMT

Everyone learns the same rule on day one: hash collisions are bad. Two inputs landing in the same bucket means wasted work — longer lookup chains, unpredictable performance, security headaches. The whole point of a good hash function is to scatter inputs as randomly and evenly as possible.

Locality-Sensitive Hashing (LSH) breaks this rule deliberately. The goal is to make similar inputs land in the same bucket, not different ones. Collisions are the product, not the defect.

This post explains why you'd want that, how MinHash makes it work, and how an LSH-backed cache can triple the effective capacity of a search cache without adding a single byte of hardware.

the problem with exact-match caching

A search cache is simple: hash the query string, look up the result. If a user has typed this exact query before, return the cached product list and skip the expensive retrieval pipeline.

The problem is that users don't type the same string twice. They type variations:

"nike running shoes"
"nike running shoe"
"running shoes nike"
"nike running sneakers"

Those four strings produce four different hash values, four separate cache entries, and four copies of essentially the same product list. At the scale of a large retailer — hundreds of millions of searches per day, billions of cached entries — this redundancy is not a rounding error. It's a significant fraction of your cache budget.

The question isn't whether this waste exists. It's whether we can do anything about it without making wrong cache hits a thing.

exact-match cache
─────────────────
"nike running shoes"     →  [products]
"nike running shoe"      →  [products]
"running shoes nike"     →  [products]
"nike running sneakers"  →  [products]

   4 cache entries — same data, copied 4 times


LSH cache
─────────
"nike running shoes"     ╮
"nike running shoe"      │
                         ├──→  [products]
"running shoes nike"     │
"nike running sneakers"  ╯

   1 cache entry — shared across all 4 queries

jaccard similarity: a way to measure "same thing"

Before we can build a smarter cache, we need a way to measure whether two queries are saying the same thing.

Jaccard similarity is the simplest useful measure: divide the number of words the two queries share by the total number of unique words across both.

A = {"nike", "running", "shoes"}
B = {"nike", "running", "shoe"}

intersection = {"nike", "running"} → size 2
union        = {"nike", "running", "shoes", "shoe"} → size 4

Jaccard(A, B) = 2 / 4 = 0.5

Two identical queries score 1.0. Completely unrelated queries score 0.0. The score lives cleanly in [0, 1].

For the cases we care about, similar queries score 0.5–0.9. Genuinely different queries (different category, different brand, different intent) tend to score below 0.2.

minhash: turning jaccard into a hash

Now the clever part. There's a family of hash functions — called MinHash — with a remarkable property:

For any two sets A and B, if you pick a random MinHash function h, then P(h(A) == h(B)) = Jaccard(A, B).

Read that again. The probability that two queries produce the same hash value equals their Jaccard similarity. If two queries are 80% similar, a random MinHash function will give them the same hash 80% of the time.

This is the mathematical foundation that makes everything else work. The proof is elegant but not required here — the key intuition is: MinHash works by randomly permuting the set and taking the minimum element. Two similar sets have a higher chance of sharing their minimum.

votes: making the signal reliable

A single MinHash function with 80% agreement probability is noisy. You'd see it disagree 20% of the time even for very similar queries, and agree 20% of the time even for dissimilar ones.

The fix is to run many hash functions and count agreements.

With 36 independent MinHash functions and a vote threshold of 18:

A query pair with Jaccard 0.8 agrees on ~29 out of 36 functions on average. Getting at least 18 agreements is almost certain.
A query pair with Jaccard 0.2 agrees on ~7 out of 36 functions on average. Getting at least 18 agreements is extremely unlikely.

The number of agreements follows a binomial distribution. With enough functions, the tails shrink and the two populations become cleanly separated. The vote count turns a noisy per-function signal into a reliable group decision.

Play with the numbers:

Two things worth noticing:

First, the S-curve crossover falls at the threshold ratio. With 18/36 votes (50%), the crossover is at Jaccard 0.5 — queries more than 50% similar get matched, queries less than 50% similar don't. Shift the threshold to 27/36 (75%) and the crossover shifts right.

Second, more hash functions means a steeper curve — a sharper boundary between "matched" and "not matched". Fewer functions gives a softer, fuzzier boundary. A common choice — 36 functions with a vote threshold of 18 — gives a curve steep enough to reliably separate similar from dissimilar while staying cheap to compute.

how the system is actually built

There are two distinct parts, running at very different timescales.

The offline cluster builder (nightly batch job)

Once a day, run a job over the past 30 days of query logs. For each of the top ~60M queries:

Compute all 36 MinHash values.
For each hash, record which bucket that query lands in.
Any two queries that land in the same bucket across multiple hash functions increment an edge weight between them.
Prune edges below a vote threshold (e.g., 20/36).
Find connected components — each component is a semantic cluster.
Pick a canonical query per cluster (simplest: most frequent query in the cluster).
Publish the mapping: canonical_query → which buckets it lives in.

The output is a static index: given any bucket ID, which canonical queries appear in it?

The online lookup (real-time, per request)

When a user query arrives:

Normalize (lowercase, trim whitespace).
Compute 36 MinHash values.
For each value, look up the canonical queries that appear in that bucket. Tally votes.
If the top-voted canonical query has ≥ 18 votes: it wins. Fetch its cached result.
If no winner: cache miss. Fall through to the full retrieval pipeline.

def get_cached_results(user_query):
    q = normalize(user_query)
    votes = Counter()

    for h in HASH_FUNCTIONS:          # 36 functions
        bucket = h(q)
        for canonical in bucket_index[bucket]:
            votes[canonical] += 1

    if not votes:
        return CACHE_MISS

    winner, count = votes.most_common(1)[0]
    return cache.get(winner) if count >= MIN_VOTES else CACHE_MISS

The 36 hash lookups can run in parallel. Each lookup is a hash table read against a compact in-memory index. At that point it's not doing search — it's doing arithmetic and array access.

why token weights matter

Plain Jaccard treats all words equally. That's not quite right for queries.

Consider:

"nike shoes" vs "adidas shoes" → Jaccard = 0.33, but these are different brand queries with different expected results
"nike shoes" vs "nike sneakers" → Jaccard = 0.33, but these almost certainly return the same products

A word like "shoes" carries more semantic meaning about the product category than a brand name. If we weight tokens by their importance — category words higher, brand names and modifiers lower — we get a similarity score that better tracks "would these two queries return the same results?"

This is weighted Jaccard. A typical implementation uses a tagger to label each token by its role — head noun, modifier, brand, and so on — and assigns weights to match. If your system already has a query tagger somewhere in the pipeline, reuse it. If not, a basic part-of-speech tagger that boosts nouns and discounts adjectives gets you 80% of the way there.

The math of weighted MinHash is slightly more involved (you weight the random permutation by token weight), but any decent library handles it — datasketch in Python, for instance. You pass in token weights, it gives you a MinHash. The rest of the system doesn't change.

the numbers

Cache capacity. If a cluster of 4–5 near-duplicate queries now shares one cache entry instead of four, and the average cluster size in your query log is 3–5 queries, you're storing 3–5x fewer entries for the same result coverage. In practice this lands around a ~3x improvement in effective cache capacity.

Hit rate on tail queries. This is where the gains are biggest. Head queries (the top 1000 searches) already have high hit rates under exact-match caching because users type them verbatim repeatedly. Tail queries — rare, varied, one-off phrasings — are where the cache fails today. LSH clustering effectively "borrows" hits from the canonical query to cover all the tail variations. On long-tail traffic, reported gains run into the multiple-x range on F1 — often cited around 250%.

Latency. The cost is real. An exact-match cache lookup is one hash + one table read (~0.1 ms). LSH lookup adds 36 hashes + 36 table reads + a vote tally (~2 ms). That's a 20x increase in cache lookup overhead. The question is whether that 2 ms is acceptable given the p99 savings from serving more cache hits (and skipping 50 ms+ retrieval pipelines on misses). For most search SLOs, it is — but measure it before you commit.

what can go wrong

Wrong cache hits. If LSH assigns a user query to the wrong canonical, they get irrelevant results with no recovery path — the cache says "hit" but the results are wrong. The vote threshold is your main defense. Set it too low and false matches creep in. The offline evaluation step (replaying query logs and comparing returned results against ground-truth retrieval output) is how you find the right threshold empirically before touching production.

Cluster staleness. Product catalogs change. A query cluster that was semantically coherent last month may not be today if a brand launches a new category or discontinues a product line. Nightly re-clustering handles the slow drift. You'll want a fast invalidation path — either a manual override or an automated signal from catalog change events — for sudden shifts.

Cold start for new queries. Any query that has never appeared in the training window won't be in any cluster. It's a cache miss, same as today. This is fine — it's the same baseline behavior — but it means LSH doesn't help at all for genuinely novel queries. Those are also your most expensive queries (novel phrasing → harder retrieval), but that's a separate problem.

one breath

The insight is that hash collisions, normally a defect to engineer away, can be made into a feature. MinHash is designed so that the probability of a collision equals the Jaccard similarity between two sets. With enough hash functions and a vote threshold, the resulting match decision is reliable. Similar queries cluster together and share one cache entry; dissimilar queries don't. Cache capacity goes up, tail-query hit rate goes up, retrieval load goes down. You pay ~2 ms extra per lookup and accept that your cache is slightly fuzzy.

The math is worth understanding once. After that, the implementation is a library call and a batch job.

— v

The KV cache, from first principles

Sat, 16 May 2026 00:00:00 GMT

The number that decides how much your LLM inference bill is doesn't appear on the model card. It isn't the parameter count. It isn't the context length. It's the KV cache — a per-request scratchpad in GPU memory that grows with every word the model generates.

If you serve models, this is the dominant resource you're managing. Every recent inference trick exists to shrink it.

This post explains what the KV cache is from scratch, using a simple analogy.

what an LLM does, in one line

A language model takes the words you've written so far and predicts the next word. That's it. Chat, code generation, agents — everything is this one trick called in a loop.

input:  "The quick brown ___"
output: "fox" (95% likely)
        "dog" (3%)
        …

Pick one, append it, repeat. That's how an LLM writes a paragraph — one word at a time, each one a guess at what comes next.

words become numbers from a fixed list

Models don't see text — they see numbers. Before anything else, the model breaks your sentence into chunks from a fixed list of about 50,000 known chunks (called tokens). Common words like "the" are one chunk. Rare words like "tokenization" get split into a few chunks ("token" + "ization") because the model has those but not the whole word.

You can watch this happen live at tiktokenizer.vercel.app — paste anything.

That's all you need to know about tokenization. The interesting part starts now.

the library

Imagine your sentence is a small library, with one book per word on the shelf.

[the] [cat] [sat] [on] [the] [mat]

When you want to understand any single book — say, "sat" — you can't just look at it in isolation. The word "sat" could mean a hundred things (sat for an exam, sat in a chair, …). You need to understand it in the context of the other books on the shelf.

The library has a catalog. Every book has an entry in the catalog with two cards:

A title card (K) — what the book advertises itself as. "I'm an action verb. I'm about sitting."
A contents card (V) — what the book actually delivers if matched. "Action of sitting, past tense, requires a subject and a location."

And every book also has its own question (Q) — what it needs to know to understand its place in this library:

"sat" asks: "What's the subject doing me? Where's it happening?"
"cat" asks: "What action am I taking? Where am I?"
"the" asks (a small question): "Which noun am I attached to?"

To answer each book's question, the model browses the catalog:

For "sat" specifically:

"cat"'s title card says "subject, noun, animal" → high match → pull in lots of "cat"'s contents card
"on"'s title card says "position word" → high match → pull in lots of "on"'s contents
"the"'s title card says "just a determiner" → low match → pull in almost nothing
"mat"'s title card says "noun being acted on" → medium match → pull in some of "mat"'s contents

The combined result — a blend of contents weighted by how well each title matched — is the new "sat". It's no longer the abstract verb "sat", it carries context: a sitting action done by a cat onto a mat. Every other book on the shelf gets the same treatment in parallel, each using its own question against everyone else's title cards.

Q asks. K announces. V delivers.

That's attention — the engine of every modern language model. The 2017 paper that introduced it is Attention Is All You Need — eight authors, eleven pages.

(In the actual model, the cards aren't paper — they're short lists of numbers. But the role each plays is exactly what the analogy says.)

doing it many times, in parallel

One catalog focuses on one type of relationship between books (maybe grammar — who's the subject of what verb). To capture different kinds of relationships — meaning, position, long-range references — the model maintains many parallel catalogs at once. Llama 3 8B has 32 of them.

Then it does the whole browsing-and-combining process again, this time using the previous round's enriched results. And again. Stacked 32 layers deep. Each layer refines the previous layer's understanding.

By layer 32, every book has a deeply layered understanding of its place in the library.

generating one word at a time

To write the next word, the model:

Runs all 32 layers of browsing and combining over the existing books.
Looks at the last book's final understanding.
Turns that into a probability over every word in the vocabulary.
Picks one — usually the most likely.
Adds that word's book to the end of the shelf.
Goes back to step 1, now with one more book on the shelf.

One rule when generating: each book can only consult catalog entries for books to its left. It can't peek at books that haven't been placed yet — those are what's being predicted. This means every book's catalog entry depends only on books to its left, never on anything to its right. Once a book's entry is in the catalog, it never changes.

That property is the opening for the optimization.

the KV cache

Here's the thing nobody tells you up front: the model doesn't remember anything between words. When it generates the next word, it doesn't pick up where it left off — it starts the whole sentence over from the beginning.

Every single word the model generates means going through every existing book again and re-generating every book's catalog entry. Just to add one new word at the end.

Imagine, every time a new book arrives at the shelf, throwing out the entire catalog and re-cataloging every existing book from scratch — including all 1,000 books that have been on the shelf for years.

Why does it work this way? Because each prediction is a self-contained calculation: "given the sentence so far, what comes next?" The "given the sentence so far" part is rebuilt every time. The model has no built-in memory between predictions.

For a 3-word sentence, that means re-cataloging 3 books to get word #4. For a 1,000-word sentence, re-cataloging 1,000 books to get word #1,001. The cost gets brutal fast.

But notice: every existing book's catalog entry would come out identical every single time. Each entry depends only on books to its left, and nothing to its left has changed. So re-cataloging is pure wasted work.

The fix is obvious once you see it: keep the catalog. After each book's title and contents cards are generated the first time, save them. Next time we add a new word, only the brand-new book needs a fresh catalog entry. All the old ones are still on file.

That saved catalog is the KV cache.

Now each generation step only requires the brand-new book to write its catalog entry. The library doesn't get re-cataloged from scratch every time. Generation stays fast even as the sentence grows.

It's safe to cache because each catalog entry is frozen once written. A book's K and V depend only on books to its left. Nothing to its right (which is what's being added) can ever change them. So a cached entry can never go stale.

The only question is: how much memory does the catalog take?

the memory cost

This is where the inference bill lives.

There's a catalog entry per book, per layer, per parallel catalog. For a typical 7B model:

~32 layers
~32 parallel catalogs per layer
~128 numbers per card

That works out to about 0.5 MB per book, per conversation. A 4,000-word conversation: ~2 GB of GPU memory per concurrent user, just for the catalog.

Try the numbers yourself:

A few things worth playing with:

Switch from Llama 2 7B to Llama 3 8B. Total memory drops 4× — Llama 3 uses a catalog-shrinking trick (read on).
Bump seq length from 4k to 32k. The catalog grows linearly with the number of books on the shelf. This is why long-context models are expensive even when the model itself is unchanged.
Bump batch to 32 (serving 32 conversations at once). You pay 32× the memory. This is when the catalog starts to dominate GPU memory — not the model weights.
Switch dtype to int8. Catalog size halves. Tiny accuracy hit, big memory win.

why everyone's optimizing the catalog

Every modern inference innovation is some variation on shrinking the catalog:

GQA (Grouped-Query Attention) — instead of every question-asker having its own dedicated title and contents cards, groups of questions share one set of cards. Fewer entries to store. Used by Llama 2 70B, Llama 3, Mistral.
Sliding-window attention — only keep catalog entries for the last w books. Older books "leave the library." Bounded memory, less long-range memory.
Quantized KV cache — write the catalog entries in shorthand (int8 or int4 instead of fp16). Half or quarter the memory at modest quality cost.
Prefix caching — if many conversations start with the same intro ("You are a helpful assistant…"), share those catalog entries across conversations.

If you're running an inference service, the KV cache is your dominant resource. Every serving framework you've heard of — vLLM, TGI, TensorRT-LLM — is mostly a story about managing the catalog well.

one breath

Your sentence is a library, one book per word.
Every book has three things: a question (Q), a title card (K), and a contents card (V).
Attention = for every book, match its question against every other book's title card, pull in their contents weighted by how well the titles matched.
Many parallel catalogs (heads), repeated many layers — Llama 3 8B = 32 × 32.
When generating, new books get added to the shelf one at a time; each can only consult books already on the shelf to its left.
Existing books' catalog entries never change — so we save them in the catalog and reuse them across generation steps. That's the KV cache.
The catalog dominates inference memory. Shrinking it is most of what serving frameworks do.

If you want to go deeper later: Karpathy's Let's build GPT is a 2-hour notebook walkthrough that builds a transformer from scratch. Best next step.

— v

Retiring pull-request-code-coverage

Fri, 15 May 2026 00:00:00 GMT

Seven years ago, on a team trying to drag itself toward test-driven development, the principal engineer I worked with wrote a small library called pull-request-code-coverage. It's been around long enough. Time to retire it.

the problem we were trying to solve

Most teams that try to adopt TDD hit the same wall: the existing coverage is awful. If you measure the whole codebase, the number is depressingly low and stays that way for months, no matter how much new code you cover. People stop looking at the dashboard. The flywheel never spins up.

The trick was reframing what we measured. Instead of asking "what's the coverage of the codebase?" we asked "what's the coverage of this pull request?" Just the lines that changed. Just the work being shipped right now.

That meant a team with 8% global coverage could set a 90% bar on new code and watch things improve one PR at a time. No three-quarter heroics. No big-bang test sprints. Just a different denominator.

why it worked

Two reasons, really:

It changed the conversation in code review. "You added 40 lines, 12 are covered" is a concrete, immediate ask. "Our project is at 23%" is no one's problem.
It met teams where they were. You didn't have to apologize for the legacy. The tool simply ignored it. You were rewarded for the next commit, not punished for the last decade.

We open-sourced it eventually. It got more use than I expected, which felt good.

why retire it now

Seven years is a long time in tooling. The same idea — diff-aware coverage — is now built into every major coverage tool, every CI provider, every code review platform. Codecov, SonarQube, Coveralls, GitHub's own checks all do it natively, with better integrations and a fraction of the setup our library needed.

And honestly: we couldn't keep up. The ecosystem moved fast. The original maintainers moved on to other work. Patches stalled, integrations lagged, the docs drifted. That's how most small open-source tools end — not with a decision, but with a slow loss of velocity.

When the better-supported alternatives exist and you can't give a tool the attention it deserves, the right move is to send people there. Maintaining a library because you wrote it isn't a reason. It's a habit.

what it taught me

Two things have stuck:

The most useful tools are reframings, not features. What changed our test culture wasn't the algorithm — it was the denominator. A small idea, well-aimed, did more than any process document ever did.
Open source is a temporary stewardship. You ship something, you support it while it's the best fit, and you let it go gracefully when it isn't. That isn't failure. That's the lifecycle working correctly.

To the people who used it, contributed, filed issues, sent patches — thank you. To the principal engineer who wrote that first commit: you were right.

The repo: github.com/target/pull-request-code-coverage

— v

Always be building

Thu, 07 May 2026 00:00:00 GMT

This is the only post on this site for now. It feels right that it's about why I still build with my hands — because the work has gotten more interesting since agents showed up, not less.

I'll be direct: agents are great. They scaffold projects, write tests, ship features, fix bugs, update docs. They get faster every month. The first time I watched one rebuild a service in an afternoon that would have cost me a weekend, I didn't feel anxious. I felt the same thing I felt the first time I used a really good profiler — I have more headroom now.

What I do with that headroom is build more, not less.

the work got bigger

When the prototype is cheap, the question worth asking is bigger. I used to spend most of my notebook entries figuring out whether an idea was worth a weekend. Now the weekend is half an evening, and the entries are about what would have to be true for this system to matter. The thinking moved up a level.

That's the actual gift. Not "look how fast I can ship." It's that the bar for what's worth shipping rises with the cost coming down.

taste is built, not preserved

I still write code by hand most days. Not because I'm protecting some skill from rusting — that framing always sounded a little defensive — but because writing is how I think. The friction of getting a thing to work is the friction of understanding it. Agents don't take that away; they let me have it on better questions.

When I review what an agent wrote, I bring opinions formed by my own builds that week. Those opinions are the part of the job that gets harder with leverage, not easier. They're worth investing in.

what I've changed

A few things look different than they did a year ago:

I write more code at the seams — review, glue, the parts where systems meet — and let agents do more of the centers.
I treat the agent like a strong collaborator: opinionated, fast, sometimes seeing what I missed. The pairing is real. I learn things.
I read more, because there's more time to. The bottleneck moved from producing code to having something true to say.
The notebook is bigger. Systems that used to stay theoretical now get prototyped. Most still don't ship — but I learn from a finished prototype what I never would have from an outline.

why "always"

The "always" isn't about hustle, and it isn't about holding a line. It's about staying close to the problem. Building keeps me close. So does writing here.

The point isn't to keep up with agents. It's to keep wanting things — wanting the system to exist, wanting it to be good. Wanting is the part nothing else does for you.

The cost of building dropped by an order of magnitude. The reason to do it didn't change. If anything, it's clearer now — building is the cheap, fast loop where I find out what I actually think.

— v