From judgment lists to trained Learning to Rank (LTR) models

Learn how to transform judgment lists into training data for Learning To Rank (LTR), design effective features, and interpret what your model learned.

Elasticsearch is packed with new features to help you build the best search solutions for your use case. Learn how to put them into action in our hands-on webinar on building a modern Search AI experience. You can also start a free cloud trial or try Elastic on your local machine now.

In Evaluating search query relevance with judgment lists, we built lists and used the _rank_eval API to measure search quality. Though this approach gave us an objective way to evaluate changes, improving relevance still requires manual query tuning.

If judgment lists answer the question, “How good is my ranking?,” Learning To Rank (LTR) answers, “How do I systematically make it better?”

In this article, we take the next step: using those judgment lists to train an LTR model using XGBoost, Eland, and Elasticsearch. We’ll focus on understanding the process rather than on implementation details. For the complete code, refer to the companion notebook.

What is LTR?

LTR uses machine learning (ML) to build a ranking function for your search engine. Instead of manually tuning query weights, you provide examples of proper rankings (your judgment list) and let the model learn what makes documents relevant. In Elasticsearch, LTR works as a second-stage reranker following retrieval of documents from Elasticsearch:

  • First stage: A standard query (BM25, vector, or hybrid) retrieves candidate documents quickly.
  • Second stage: The LTR model reranks the top results using multiple signals it learned to combine.

For a deeper introduction, see Introducing Learning To Rank (LTR) in Elasticsearch.

The journey from judgment list to model

A judgment list tells us which documents should rank highly for a given query. But the model cannot learn directly from document IDs. It needs numerical signals that explain why certain documents are potentially relevant.

The process works like this:

  1. Start with judgments. Query-document pairs with relevance grades, so you define that doc1 is a good match for “DiCaprio performance” search terms.
  2. Extract features. For each query-document pair, compute numerical signals, some about the document alone (for example, popularity), and others about how the query and document interact (for example, BM25 score).
  3. Train the model. The model learns which feature patterns predict high grades.
  4. Deploy. Deploy the trained model to your Elasticsearch cluster.
  5. Query. Use the model to rerank search results.

The key insight is that features must capture what your judgments are measuring. If your judgment list rewards popular thriller movies but your features only include text-matching scores, the model has no way to learn what makes those documents relevant.

What are features?

Features are numerical values that describe a query-document pair. In Elasticsearch, we define features using queries that return scores. There are three types:

  • Query-document features measure how well a query matches a document. Eland provides the QueryFeatureExtractor utility to define these features, which computes the BM25 relevance score for each query-document pair:

This extracts the BM25 score from the title field for each document relative to the query.

  • Document features are properties of the document that don’t depend on the query. You can extract these using script_score or function_score:
  • Query features describe the query itself, like the number of terms. These are less common but can help the model handle different query types.

Designing your feature set

Choosing features isn’t random. Each feature should capture a signal that might explain why users prefer certain documents. Let's look at the features from the LTR notebook and understand the reasoning:

FeatureTypePurpose
`title_bm25`Query-documentTitle matches are strong relevance signals. For example, a movie titled Star Wars should rank highly for the query "star wars".
`actors_bm25`Query-documentSome users search by actor name. If they search for "leonardo dicaprio movies", they should get films starring Leonardo DiCaprio.
`title_all_terms_bm25`Query-documentThis is a stricter version of title matching where all query terms must be present. It helps distinguish between exact matches and partial ones.
`actors_all_terms_bm25`Query-documentThis is the same stricter matching logic as described above but applied specifically to actors.
`popularity`DocumentUsers generally prefer well-known movies over obscure ones when relevance is similar. A popular Star Wars film should rank above a low-budget parody with "Star Wars" in the title.

Notice the strategy here:

  • Multiple signals for the same concept. We have both title_bm25 (lenient) and title_all_terms_bm25 (strict). The lenient version scores any document where at least one query term matches the title, and the strict version requires all the terms to be present. For short queries, the lenient match might be enough; whereas for longer, more specific queries, strict matching might be more important. The model can learn when to rely on each.
  • Text features plus quality features. Text matching alone can return irrelevant documents that happen to contain the right words. The popularity feature lets the model boost well-known, quality content when text scores are similar.
  • Coverage for different query types. Some queries target titles ("star wars"), and others target actors ("dicaprio movies"). Having features for both means that the model can handle diverse searches.

When designing your own features, ask yourself, "What signals would a human use to decide if this document is relevant?" Those are your candidate features.

Building the training dataset

Once features are defined, we extract them for every query-document pair in our judgment list. The result is a training dataset where each row contains:

  • The query identifier.
  • The document identifier.
  • The relevance grade (from our judgment list).
  • All feature values.

Here’s a simplified example:

`query_id``query``doc_id``grade`
qid:1star wars114
qid:1star wars121803
qid:1star wars2784271
qid:2tom hanks movies8574
qid:2tom hanks movies133

A few things to notice:

NaN values are normal. When a query doesn’t match a field, the feature returns no score. The movie Star Wars has a high title_bm25 but no actors_bm25 because the query "star wars" doesn’t match any actor names.

Queries are grouped during training. The query_id column tells the model which documents to compare against each other. For "star wars", it learns that document 11 (grade 4) should rank above document 278427 (grade 1).

But here’s the important part: The model doesn’t memorize these specific queries. Instead, it learns general patterns, like "documents with high title_bm25 AND high popularity tend to have high grades." When presented with a new query, the model applies these learned patterns to rank the results.

Features must explain grade differences. Look at qid:1: The grade 4 document has a higher title_bm25 and higher popularity than the grade 1 document. These patterns are what the model learns.

Training the LTR model

With the training dataset prepared, we train an XGBoost model with a ranking objective. The model builds decision trees that learn patterns like:

  • "If title_bm25 > 10 and popularity > 50, predict high relevance."
  • "If title_bm25 is missing but actors_bm25 > 12, still predict moderate relevance."

Here's how the training process works in practice:

During training, the model tries different combinations of these rules and measures how well the resulting rankings match your judgment grades. It uses a metric called Normalized Discounted Cumulative Gain (NDCG) to score itself. A perfect NDCG of 1.0 means that the model's ranking exactly matches your judgments. Lower scores mean that some relevant documents are ranking below where they should be.

The training also uses a technique called early stopping. If the model's score stops improving for several rounds, training halts automatically. This prevents the model from memorizing the training data too closely, which would hurt its ability to generalize to new queries.

The companion notebook contains the complete training code.

Understanding what your LTR model learned

After training, XGBoost can show you which features the model relied on most. You can generate a feature importance chart using XGBoost's built-in visualization:

The importance_type="weight" parameter shows how often each feature was used in tree splits. Here’s the resulting chart:

The F score counts how many times each feature was used to make split decisions across all trees in the model. Higher values mean that the model relied on that feature more often.

In this example:

  • popularity (2178): The most important feature. The model frequently uses popularity to separate relevant from nonrelevant documents.
  • title_bm25 (1642): Second-most important. Title matches matter a lot for movie searches.
  • actors_bm25 (565): Moderately important. This is useful for queries that mention actors.
  • title_all_terms_bm25 (211): Occasionally useful. The stricter matching helps for some queries.
  • actors_all_terms_bm25 (63): Rarely used. The model found this feature less predictive.

This chart helps you iterate on your feature set. If a feature that you expected to be important shows near-zero importance, investigate why. Maybe the feature extraction is not working as intended, or maybe that signal doesn’t actually predict relevance in your judgment data.

Deploying and using the LTR model

Once trained, upload the model to Elasticsearch using Eland:

Once uploaded, the model can be used as a rescorer retriever to be combined with other retrievers for multistage search pipelines:

Response (simplified):

The first-stage query retrieves candidates using BM25. The LTR model then reranks the top 50 results using all the features it learned to weight.

For the sake of the example, the multi_match query alone would return some less relevant results on the first positions that LTR helped to fix:

Conclusion

The path from judgment lists to a working LTR model involves three key steps: designing features that capture relevance signals, building a training dataset that pairs those features with your judgment grades, and training a model that learns the patterns.

Our previous article becomes the starting point for this process. Your grades define what "relevant" means and how to measure it, and your features give the model the signals to predict it.

For the complete implementation with a dataset of 9,750 movies and 384,755 judgment rows, see the LTR notebook. For advanced use cases, like personalized search, see Personalized search with LTR.

¿Te ha sido útil este contenido?

No es útil

Algo útil

Muy útil

Contenido relacionado

¿Estás listo para crear experiencias de búsqueda de última generación?

No se logra una búsqueda suficientemente avanzada con los esfuerzos de uno. Elasticsearch está impulsado por científicos de datos, operaciones de ML, ingenieros y muchos más que son tan apasionados por la búsqueda como tú. Conectemos y trabajemos juntos para crear la experiencia mágica de búsqueda que te dará los resultados que deseas.

Pruébalo tú mismo