LLMs and Embeddings
Overview
MemOS decouples model logic from runtime config via two Pydantic factories:
| Factory | Produces | Typical backends |
|---|---|---|
LLMFactory | Chat‑completion model | ollama, openai, qwen, deepseek, huggingface |
EmbedderFactory | Text‑to‑vector encoder | ollama, sentence_transformer, universal_api |
Both factories accept a *_ConfigFactory(model_validate(...)) blob, so you can switch provider with a single backend= swap.
LLM Module
Supported LLM Backends
| Backend | Notes | Example Model Id |
|---|---|---|
ollama | Local llama‑cpp runner | qwen3:0.6b etc. |
openai | Official or proxy | gpt-4o-mini, gpt-3.5-turbo etc. |
qwen | DashScope‑compatible | qwen-plus, qwen-max-2025-01-25 etc. |
deepseek | DeepSeek REST API | deepseek-chat, deepseek-reasoner etc. |
huggingface | Transformers pipeline | Qwen/Qwen3-1.7B etc. |
LLM Config Schema
Common fields:
| Field | Type | Default | Description |
|---|---|---|---|
model_name_or_path | str | – | Model id or local tag |
temperature | float | 0.8 | |
max_tokens | int | 1024 | |
top_p / top_k | float / int | 0.9 / 50 | |
| API‑specific | e.g. api_key, api_base | – | OpenAI‑compatible creds |
remove_think_prefix | bool | True | Strip /think role content |
Factory Usage
from memos.configs.llm import LLMConfigFactory
from memos.llms.factory import LLMFactory
cfg = LLMConfigFactory.model_validate({
"backend": "ollama",
"config": {"model_name_or_path": "qwen3:0.6b"}
})
llm = LLMFactory.from_config(cfg)
LLM Core APIs
| Method | Purpose |
|---|---|
generate(messages: list) | Return full string response |
generate_stream(messages) | Yield streaming chunks |
Streaming & CoT
messages = [{"role": "user", "content": "Let’s think step by step: …"}]
for chunk in llm.generate_stream(messages):
print(chunk, end="")
Find all scenarios in
examples/basic_modules/llm.py.Performance Tips
- Use
qwen3:0.6bfor <2 GB footprint when prototyping locally. - Combine with KV Cache (see KVCacheMemory doc) to cut TTFT .
Embedding Module
Supported Embedder Backends
| Backend | Example Model | Vector Dim |
|---|---|---|
ollama | nomic-embed-text:latest | 768 |
sentence_transformer | nomic-ai/nomic-embed-text-v1.5 | 768 |
universal_api | text-embedding-3-large | 3072 |
Embedder Config Schema
Shared keys: model_name_or_path, optional API creds (api_key, base_url), etc.
Factory Usage
cfg = EmbedderConfigFactory.model_validate({
"backend": "ollama",
"config": {"model_name_or_path": "nomic-embed-text:latest"}
})
embedder = EmbedderFactory.from_config(cfg)
MemScheduler
MemScheduler is a concurrent memory management system parallel running with the MemOS system, which coordinates memory operations between working memory, long-term memory, and activation memory in AI systems. It handles memory retrieval, updates, and compaction through event-driven scheduling. <br/> This system is particularly suited for conversational agents and reasoning systems requiring dynamic memory management.
KV Cache Memory
KVCacheMemory is a specialized memory module in MemOS for storing and managing key-value (KV) caches, primarily used to accelerate large language model (LLM) inference and support efficient context reuse. It is especially useful for activation memory in conversational and generative AI systems.