Understanding the BM25 Formula: A Practical Guide to Modern Information Retrieval

When you type a query into a search engine—whether it’s Google, a digital library, or an internal enterprise search system—the system must decide which documents are most relevant to your query. One of the most influential algorithms used for this ranking task is BM25, short for Best Matching 25. Despite its somewhat cryptic name, BM25 is grounded in intuitive principles about how words relate to relevance.

This article explains what BM25 is, where it comes from, how the formula works, and why it remains so important in modern search systems.

1. The Origins of BM25

BM25 was developed as part of the Okapi information retrieval system at City University, London in the 1990s. It belongs to a family of ranking functions derived from the Probabilistic Information Retrieval Model.

The core idea behind probabilistic retrieval is simple:

A document is relevant if it has a high probability of satisfying the user’s query.

BM25 estimates this probability using term frequency, document length, and inverse document frequency. Over time, it proved so effective that it became the default ranking function in systems like Apache Lucene and Elasticsearch.

2. The BM25 Formula

The BM25 scoring function for a document (D) and query (Q) is:

$$ \text{score}(D, Q) = \sum_{t \in Q} IDF(t) \cdot \frac{f(t, D)\cdot (k_1 + 1)} {f(t, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)} $$

Let’s break this down step by step.

3. Key Components of BM25

3.1 Term Frequency $f(t, D)$

This measures how many times term (t) appears in document (D).

Intuition:

If a word appears frequently in a document, that document is more likely to be relevant.
But repetition has diminishing returns—after a certain point, seeing the word more times doesn’t help much.

BM25 handles this using a saturation function controlled by parameter (k_1).

3.2 Inverse Document Frequency $IDF$

IDF measures how rare a word is across the document collection:

$$ IDF(t) = \log \frac{N - n_t + 0.5}{n_t + 0.5} $$

Where:

$N$ = total number of documents
$n_t$ = number of documents containing term (t)

Intuition:

Rare terms are more informative.
Common words like “the” or “is” provide little relevance signal.

IDF increases the weight of rare words and decreases the weight of common ones.

3.3 Document Length Normalization

$$ \frac{|D|}{\text{avgdl}} $$

Where:

$|D|$ = length of document $D$
$\text{avgdl}$ = average document length in the collection

Longer documents naturally contain more words, so they might match query terms more often just by chance. BM25 corrects for this using the parameter $b$.

3.4 Parameters $k_1$ and $b$

BM25 has two tunable parameters:

$k_1$ (typically between 1.2 and 2.0)
- Controls how quickly term frequency saturates.
- Higher values make frequency more influential.
$b$ (between 0 and 1, usually 0.75)
- Controls document length normalization.
- $b = 1$: full normalization
- $b = 0$: no length normalization

These parameters allow BM25 to adapt to different types of corpora.

4. Why BM25 Works So Well

BM25 is powerful because it balances three intuitive signals:

Term importance (IDF) Rare terms matter more.
Term frequency (TF saturation) More occurrences increase relevance—but with diminishing returns.
Document length normalization Prevents long documents from unfairly dominating.

Unlike simpler models such as TF-IDF, BM25 handles term frequency in a non-linear way. This prevents extreme term repetition from overwhelming the ranking.

5. A Simple Intuitive Example

Imagine a query:

machine learning

Suppose we compare two documents:

Document A: Short article mentioning “machine learning” 3 times.
Document B: Very long article mentioning it 3 times.

Even though both mention the term equally often:

Document A should probably rank higher.
Document B might just contain the phrase incidentally.

BM25 adjusts for document length, so shorter, focused documents often score higher.

Now imagine:

Document C mentions “machine learning” 50 times.

TF-IDF might rank it extremely high. BM25 reduces the impact of excessive repetition through its saturation mechanism.

6. BM25 vs. TF-IDF

Feature	TF-IDF	BM25
Term Frequency	Linear	Saturating (non-linear)
Length Normalization	Basic	Tunable via (b)
Theoretical Basis	Heuristic	Probabilistic model
Performance	Good	Often superior

BM25 is essentially a refined and theoretically grounded version of TF-IDF.

7. BM25 in Modern Search and AI

Even in the era of neural search and embeddings, BM25 remains highly relevant:

Used in Elasticsearch and Lucene as default ranking
Forms the lexical component in hybrid search systems
Combined with vector similarity in Retrieval-Augmented Generation (RAG)
Serves as a strong baseline in research

Interestingly, many modern AI search systems combine BM25 (symbolic, lexical search) with dense vector embeddings (semantic search) to get the best of both worlds.

8. Strengths and Limitations

Strengths

Simple and computationally efficient
Interpretable
Strong empirical performance
Works well on large corpora

Limitations

Cannot capture semantic similarity (e.g., “car” vs “automobile”)
Relies on exact term matching
Parameters require tuning

This is why BM25 is often paired with embedding-based methods in modern systems.

9. Why It’s Still Important

BM25 is one of the most influential ranking algorithms ever developed. Even decades after its introduction, it remains:

A production-standard ranking function
A research baseline
A core component in hybrid search architectures

BM25 is not just a formula—it’s a carefully designed balance between term rarity, term frequency, and document length. Its strength lies in its simplicity and probabilistic grounding.

While neural models and embeddings are advancing rapidly, BM25 continues to serve as the backbone of many search systems. For anyone working in information retrieval, search engineering, or AI-powered knowledge systems, understanding BM25 is essential.

It represents one of the clearest examples of how mathematical modeling can transform raw text into ranked relevance.

BM25Information RetrievalSearch

Create your AI Agent

Automate customer interactions in just minutes with your own AI Agent.

Get started for free Chat with AI for fun

Featured posts

Neural Networks: Unleashing the Power of Artificial Intelligence

A neural network is a collection of interconnected artificial neurons, also known as nodes or units, organized into layers. These layers work together to process and analyze complex patterns and relationships within input data. The fundamental building block of a neural network is the artificial neuron, which takes multiple inputs, performs a mathematical calculation on them, and produces an output.

What Are the X's Posting Limits?

X implements various posting limits as part of its operational strategy. These limitations are not intended to hinder users but rather to safeguard the platform's reliability, prevent system overloads, and minimize the occurrence of error pages. By setting these boundaries, X aims to distribute resources effectively, ensuring a seamless experience for its vast user base.

The Classification Problems and Their Solutions in Machine Learning

Classification problems are vital in machine learning (ML) and artificial intelligence (AI) applications. They play significant roles across various industries, including healthcare and finance. Classification involves categorizing data into predefined classes or groups. The goal is to predict the class of an unlabeled instance based on input features. Addressing these problems accurately is essential for decision-making in different fields.

10 Inspirational Quotes by Nick Kljaic, CEO of AskHandle

In his journey of building and leading AskHandle to the forefront of the tech industry, Nick Kljaic has shared invaluable lessons about innovation, leadership, and the power of a positive mindset. Here are ten quotes that reflect his personal and professional ethos, demonstrating the principles he values and lives by:

What Is Two-Factor Authentication

In the current landscape of online security breaches, safeguarding our digital lives is crucial. Two-factor authentication (2FA) serves as a gatekeeper for your online accounts, ensuring only authorized users can access them.

The Magic of UTM Tags in Marketing

Crafting a successful marketing campaign is quite the digital puzzle. Just like a breadcrumb trail helps you find your way back through the woods, UTM tags help you track your way back through the maze of digital marketing to see what's truly leading your audience to click, engage, and convert.

What Makes Famous Music Festivals in August So Special?

August brings a host of exciting music festivals across the globe. The warm weather, vacation vibes, and a passion for music unite fans for unforgettable experiences. What sets these festivals apart? Let's explore some of the standout music festivals in August.

Machine Learning: The Brain Behind AI Capabilities

Artificial Intelligence, or AI, often sweeps us off our feet with its capability to perform tasks that, until recently, were strictly under the human intelligence domain. From self-driving cars to virtual assistants like Amazon Alexa or Google Home, AI is transforming our lives in profound ways. But what fuels these intelligent behaviors? The answer lies in Machine Learning (ML), a fundamental subset and arguably the most influential component of AI.

Achieve more with AI

Enhance your customer experience with an AI Agent today. Easy to set up, it seamlessly integrates into your everyday processes, delivering immediate results.

Try for free Get a demo

Latest posts

AskHandle Blog

Ideas, tips, guides, interviews, industry best practices, and news.

• January 3, 2026

How Does an Electric Car Work?

Electric cars turn electrical energy into motion with fewer moving parts than a gasoline vehicle. Instead of burning fuel to create power, an EV stores electricity in a battery and uses electronics and a motor to drive the wheels. The result is smooth acceleration, quiet operation, and a drivetrain built around energy flow rather than combustion.

Electric carsBatteryMotor

• May 22, 2024

What Are the 4 Ps of Marketing?

Marketing connects with your audience to encourage engagement with your product or service. A simple yet effective framework used by marketers is the 4 Ps of Marketing. The 4 Ps are Product, Price, Place, and Promotion.

4PsMarketing strategyMarketing

• April 3, 2024

How to Write the Perfect Prompt for Ideal AI Responses

A prompt is your way of conversing with Generative AI, telling it what you need in a language it understands. Mastering prompt engineering, or the art of crafting these instructions, can significantly elevate the quality of results you get. Whether you're a writer seeking inspiration, a developer working on a project, or a curious soul exploring AI's capabilities, the clarity, specificity, and structure of your prompt make all the difference. Let’s embark on a journey to unlock the secrets of writing a good prompt that leads you to your desired outcome, using straightforward language and practical examples.

PromptGen AIAI

View all posts