How Do We Use LLMs For Code Search?
Developers need fast, precise ways to find code that matches a task or concept. AI can make search feel like asking a teammate instead of guessing filenames or symbols.
What Are We Trying To Achieve?
- Ask questions in natural language and get relevant functions, classes, or patterns
- Search across languages and repositories
- Return snippets with file paths and line ranges
- Explain why a match is relevant and offer usage examples
System Overview
A practical AI code search system has three loops:
- Indexing loop – parse repos, split code into units, create embeddings, store vectors and metadata.
- Query loop – rewrite the user query, run hybrid retrieval (lexical + vector), re-rank, and return matches.
- Answer loop – feed the top results to an LLM for summarization, examples, and next-step guidance, with strict “don’t-make-stuff-up” prompting.
Indexing Pipeline (Concrete Steps)
-
Repo intake
- Pull source from main and active branches.
- Respect
.gitignore
and exclude vendor, build, and minified assets.
-
Code splitting
- Prefer semantic chunks: one function or method per chunk; fall back to small text windows (e.g., 100–200 lines with overlap) when parsing fails.
- Extract metadata per chunk: language, file path, symbol name, start/end lines, docstrings, imports.
-
Static structure
- Build a symbol table with references and callers.
- Capture an AST digest (node types, identifiers) to aid structural matching.
-
Embeddings
- Use a code-aware embedding model.
- Create vectors for each chunk and a separate vector for the docstring/comment-only view.
- Normalize vectors; store in a vector index such as FAISS or a similar ANN store.
-
Lexical sidecar
- Build a keyword/BM25 index (filenames, identifiers, comments).
- Keep n‑gram and regex support for exact symbol or API lookups.
-
Storage
- Save:
{id, repo, branch, path, lang, symbol, lines, vector, text, doc_vector, imports, callers}
. - Incremental reindex on commit using git hooks or CI.
- Save:
Query Pipeline (What Runs On Each Search)
-
Query rewriting with an LLM
-
Expand the user query into:
- synonyms/aliases (
dict merge
,map update
) - language constraints (“Python only”)
- structural hints (function, class, interface, test)
- optional regex or API names if present
- synonyms/aliases (
-
-
Hybrid retrieval
- Run lexical search and vector search in parallel.
- Take the union of top‑k results from both (e.g., 200 total).
-
Semantic re-ranking
- Use a cross-encoder or an LLM “judge” prompt to score each
(query, snippet)
pair. - Add features to the score: path match, language match, recent edits, popularity, call graph proximity.
- Use a cross-encoder or an LLM “judge” prompt to score each
-
Diversification
- Apply maximal marginal relevance (MMR) so results are not near-duplicates.
-
Result packaging
- Return: snippet, file path, line range, why-it-matches note, quick usage example.
Answer Loop (Turning Results Into Help)
Feed the top snippets and metadata to an LLM with strict instructions:
- Cite file paths and lines for every claim.
- Prefer code directly from the repo; do not invent APIs.
- If confidence is low, ask a clarifying question or show multiple candidates.
- Offer a minimal working example using only retrieved snippets.
RAG prompt sketch
Html
Practical Recipes
Natural-Language → Code
-
“merge two dicts without mutation in Python”
- Query rewrite adds:
copy
,dict
,update
,|
operator,Mapping
- Structural filter: functions returning a new mapping
- Results re-ranked with preference for pure functions and tests referencing them
- Query rewrite adds:
Code → Code (reverse lookup)
- Paste a call site; ask for its definition or similar implementations.
- Embed the pasted code and run vector search to find near neighbors across languages.
API-migration search
- Input: old API call; ask for code that uses a replacement API.
- Lexical side finds call sites; vector side surfaces adapter functions and tests.
- LLM generates a patch sketch referencing actual files/lines.
Snippet Scoring Heuristics That Help
- Path prior: prefer
src/
overexamples/
, prefer non‑deprecated directories. - Freshness: recently touched code gets a small boost.
- Test linkage: snippets referenced in tests rank higher.
- Comment density: helpful docstrings increase the score slightly.
- Call graph: functions widely referenced near the query context climb the list.
Prompt Patterns You Can Reuse
Query rewrite
Html
Judge (re-ranker)
Html
Evaluating Quality
- Top‑k recall: does the correct file appear in the top 10/50?
- MRR / nDCG: ranking quality for ground‑truth query→file pairs.
- Time‑to‑first‑useful‑click: user-centric metric.
- Abstention rate: frequency of honest “not found” responses.
- Error audits: sample failures where the system pointed to wrong code.
Create a small gold set: real tickets, PR review questions, and onboarding tasks; map each to the expected files/lines.
Privacy And Security
- Keep embedding and indexing on infrastructure you control for private repos.
- Strip secrets and large binaries from the index.
- Respect license boundaries when mixing public and private code.
- Log queries and clicks with redaction; never store raw prompts that include credentials.
Common Pitfalls
- Chunks too large, hiding the target function in noise.
- Index drift from stale branches; run scheduled refreshes.
- Re-ranker absent or weak, causing noisy top results.
- Prompts that allow hallucinated APIs; add strict rules and abstain logic.
- Ignoring structural signals (AST, call graph, tests) that could break ties.
Quick Start Checklist
- Parse repos and split into function-level chunks.
- Build embeddings and a FAISS index; build a BM25 index too.
- Add metadata: path, language, symbol, lines, tests, imports.
- Implement LLM query rewrite and hybrid retrieval.
- Re-rank with a cross-encoder or LLM judge; diversify results.
- Wrap the top snippets in a RAG prompt with strict “no invention” rules.
- Track top‑k recall, MRR, time‑to‑first‑click; iterate on chunking and prompts.
- Add CI hooks for incremental indexing and stale-index alerts.
AI code search shines when it blends structure (AST, symbols), statistics (embeddings), and dialogue (prompts that reward honesty). Start small on one repo, tune chunking and re-ranking, then scale to the rest of your codebase.