LLMs and Embeddings

A practical guide to configuring and using Large Language Models (LLM) and Embedders in MemOS.

Overview

MemOS decouples model logic from runtime config via two Pydantic factories:

Factory	Produces	Typical backends
`LLMFactory`	Chat‑completion model	`ollama`, `openai`, `qwen`, `deepseek`, `huggingface`
`EmbedderFactory`	Text‑to‑vector encoder	`ollama`, `sentence_transformer`, `universal_api`

Both factories accept a *_ConfigFactory(model_validate(...)) blob, so you can switch provider with a single backend= swap.

LLM Module

Supported LLM Backends

Backend	Notes	Example Model Id
`ollama`	Local llama‑cpp runner	`qwen3:0.6b` etc.
`openai`	Official or proxy	`gpt-4o-mini`, `gpt-3.5-turbo` etc.
`qwen`	DashScope‑compatible	`qwen-plus`, `qwen-max-2025-01-25` etc.
`deepseek`	DeepSeek REST API	`deepseek-chat`, `deepseek-reasoner` etc.
`huggingface`	Transformers pipeline	`Qwen/Qwen3-1.7B` etc.

LLM Config Schema

Common fields:

Field	Type	Default	Description
`model_name_or_path`	str	–	Model id or local tag
`temperature`	float	0.8
`max_tokens`	int	1024
`top_p` / `top_k`	float / int	0.9 / 50
API‑specific	e.g. `api_key`, `api_base`	–	OpenAI‑compatible creds
`remove_think_prefix`	bool	True	Strip `/think` role content

Factory Usage

from memos.configs.llm import LLMConfigFactory
from memos.llms.factory import LLMFactory

cfg = LLMConfigFactory.model_validate({
    "backend": "ollama",
    "config": {"model_name_or_path": "qwen3:0.6b"}
})
llm = LLMFactory.from_config(cfg)

LLM Core APIs

Method	Purpose
`generate(messages: list)`	Return full string response
`generate_stream(messages)`	Yield streaming chunks

Streaming & CoT

messages = [{"role": "user", "content": "Let’s think step by step: …"}]
for chunk in llm.generate_stream(messages):
    print(chunk, end="")

Full code
Find all scenarios in examples/basic_modules/llm.py.

Performance Tips

Use qwen3:0.6b for <2 GB footprint when prototyping locally.
Combine with KV Cache (see KVCacheMemory doc) to cut TTFT .

Backend	Example Model	Vector Dim
`ollama`	`nomic-embed-text:latest`	768
`sentence_transformer`	`nomic-ai/nomic-embed-text-v1.5`	768
`universal_api`	`text-embedding-3-large`	3072

MemScheduler is a concurrent memory management system parallel running with the MemOS system, which coordinates memory operations between working memory, long-term memory, and activation memory in AI systems. It handles memory retrieval, updates, and compaction through event-driven scheduling. <br/> This system is particularly suited for conversational agents and reasoning systems requiring dynamic memory management.

KV Cache Memory

KVCacheMemory is a specialized memory module in MemOS for storing and managing key-value (KV) caches, primarily used to accelerate large language model (LLM) inference and support efficient context reuse. It is especially useful for activation memory in conversational and generative AI systems.

LLMs and Embeddings

Overview

LLM Module

Supported LLM Backends

LLM Config Schema

Factory Usage

LLM Core APIs

Streaming & CoT

Performance Tips

Embedding Module

Supported Embedder Backends

Embedder Config Schema

Factory Usage