Reranker Model

Usage Examples

Rerank memory relevance based on the self-developed memos-reranker small model.

MemOS provides a memory reranking API based on the memos-reranker model series (including 0.6B lightweight and 4B enhanced versions, base model uses qwen-reranker post-training). Developers can directly pass a user query and a list of candidate memories to complete memory relevance reranking in one call.

Request/response fields and OpenAPI: Rerank Memory.
Auth, base URL, and calling conventions match MemOS Cloud Quick Start.

When to use memory reranking

The reranking API fits when you need:

  • Memory recall optimization: After retrieving a large number of candidate memories, accurately filter out the memories most relevant to the current query through reranking to improve the quality of context injection.
  • Low latency at high QPS: Based on a 0.6B small model, suitable for latency-sensitive and frequently invoked business scenarios.
  • Flexible sorting control: Supports custom candidate document lists, can be used with any retrieval system, and does not rely on the MemOS memory store.

How it works

The memory reranking API and interaction with the model are shown in the figure below:

The end-to-end flow of the reranking model is as follows:

  1. Query Input
    Developers pass in the user query (query) and the candidate memory document list (documents).
  2. Encoding & Representation
    After model encoding, relevance scores are output.
  3. Relevance Scoring
    The relevance scores are mainly divided into 5 stages as shown in the figure. Developers can set thresholds according to actual scenarios.

Get started

import os
import requests
import json

# Replace with your MemOS API Key
os.environ["MEMOS_API_KEY"] = "YOUR_API_KEY"
os.environ["MEMOS_BASE_URL"] = "https://memos.memtensor.cn/api/openmem/v1"

data = {
    # Available models: memos-reranker-0.6b (lightweight) or memos-reranker-4b (enhanced)
    "model": "memos-reranker-0.6b",
    "query": "Any liquor recommendations for me?",
    "documents": [
        "User prefers Jiangxiang-flavored baijiu, like Moutai.",
        "I don't drink alcohol."
    ]
}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Token {os.environ['MEMOS_API_KEY']}"
}
url = f"{os.environ['MEMOS_BASE_URL']}/rerank"

res = requests.post(url=url, headers=headers, data=json.dumps(data))
print(f"result: {res.json()}")

Limits

  • Maximum length of a single document: 32k tokens.
  • Maximum length of query: 2k tokens.
  • Synchronous only today: the API returns all results at once after the reranking is completed.

Compared to Embedding Retrieval

DimensionReranking APIEmbedding Retrieval
Core behaviorPrecision ranking of candidate docs, outputting relevance scoresSemantic similarity recall, fast coarse filtering
Storage❌ Does not write to the MemOS memory store❌ Does not write to the MemOS memory store
Model0.6B/4B reranking modelsEmbedding model
Precision✅ High (cross-encoding, query-doc interaction)General (dual-tower encoding, independent representation)
SpeedSlower (requires pair-by-pair computation)✅ Fast (vector approximate retrieval)
AsyncNot supportedNot supported
Typical usePost-retrieval precision ranking / Memory quality assessmentFast recall from massive memory store