KVCacheMemory: Key-Value Cache for Activation Memory

KVCacheMemory is a specialized memory module in MemOS for storing and managing key-value (KV) caches, primarily used to accelerate large language model (LLM) inference and support efficient context reuse. It is especially useful for activation memory in conversational and generative AI systems.

KV-cache Memory Use Cases

In MemOS, KV-cache memory is best suited for storing semantically stable and frequently reused background content such as:

Frequently asked questions (FAQs) or domain-specific knowledge
Prior conversation history

These stable plaintext memory items are automatically identified and managed by the MemScheduler module. Once selected, they are converted into KV-format representations (KVCacheItem) ahead of time. This precomputation step stores the activation states (Key/Value tensors) of the memory in a reusable format, allowing them to be injected into the model’s attention cache during inference.

Once converted, these KV memories can be reused across queries without requiring re-encoding of the original content. This reduces the computational overhead of processing and storing large amounts of text, making it ideal for applications that require rapid response times and high throughput.

Why KV-cache Memory

Integrating MemScheduler with KV-cache memory enables significant performance optimization, particularly in the prefill phase of LLM inference.

Without KVCacheMemory

Each new query is appended to the full prompt, including the background memory.
The model must recompute token embeddings and attention over the full sequence — even for unchanged memory.

With KVCacheMemory

The background content is cached once as Key/Value tensors.
For each query, only the new user input (query tokens) is encoded.
The previously cached KV is injected directly into the attention mechanism.

Benefits

This separation reduces redundant computation in the prefill phase and leads to:

Skipping repeated encoding of background content
Faster attention computation between query tokens and cached memory
Lower Time To First Token (TTFT) latency during generation

This optimization is especially valuable in:

Multi-turn chatbot interactions
Retrieval-augmented or context-augmented generation (RAG, CAG)
Assistants operating over fixed documentation or FAQ-style memory

KVCacheMemory Acceleration Evaluation

To validate the performance impact of KV-based memory injection, we conducted a set of controlled experiments simulating real memory reuse in MemOS.

Experiment Setup

During typical usage, the MemScheduler module continuously tracks interaction patterns and promotes high-frequency, stable plaintext memory into KV format. These KV memories are loaded into GPU memory as activation caches and reused during inference.

The evaluation compares two memory injection strategies:

Prompt-based injection: background memory is prepended as raw text.
KV-cache injection: memory is injected directly into the model’s attention cache.

We test these strategies across:

Three context sizes: short, medium, and long
Three query types: short-form, medium-form, and long-form

The primary metric is Time To First Token (TTFT), a key latency indicator for responsive generation.

Results

The following table shows results across three models (Qwen3-8B, Qwen3-32B, Qwen2.5-72B). TTFT under KV-cache injection is consistently lower than prompt-based injection, while the output tokens remain identical across both strategies.

Build (s) refers to the one-time preprocessing cost of converting the memory to KV format, amortized across multiple queries.

Model	Ctx	CtxTok	Qry	QryTok	Build (s)	KV TTFT (s)	Dir TTFT (s)	Speedup (%)
Qwen3-8B	long	6064	long	952.7	0.92	0.50	2.37	79.1
			medium	302.7	0.93	0.19	2.16	91.1
			short	167	0.93	0.12	2.04	94.2
	medium	2773	long	952.7	0.41	0.43	1.22	64.6
			medium	302.7	0.41	0.16	1.08	85.1
			short	167	0.43	0.10	0.95	89.7
	short	583	long	952.7	0.12	0.39	0.51	23.0
			medium	302.7	0.12	0.14	0.32	55.6
			short	167	0.12	0.08	0.29	71.3
Qwen3-32B	long	6064	long	952.7	0.71	0.31	1.09	71.4
			medium	302.7	0.71	0.15	0.98	84.3
			short	167	0.71	0.11	0.96	88.8
	medium	2773	long	952.7	0.31	0.24	0.56	56.9
			medium	302.7	0.31	0.12	0.47	75.1
			short	167	0.31	0.08	0.44	81.2
	short	583	long	952.7	0.09	0.20	0.24	18.6
			medium	302.7	0.09	0.09	0.15	39.6
			short	167	0.09	0.07	0.14	53.5
Qwen2.5-72B	long	6064	long	952.7	1.26	0.48	2.04	76.4
			medium	302.7	1.26	0.23	1.82	87.2
			short	167	1.27	0.15	1.79	91.4
	medium	2773	long	952.7	0.58	0.39	1.05	62.7
			medium	302.7	0.58	0.18	0.89	79.2
			short	167	0.71	0.23	0.82	71.6
	short	583	long	952.7	0.16	0.33	0.43	23.8
			medium	302.7	0.16	0.15	0.27	43.2
			short	167	0.16	0.10	0.25	60.5

KV-cache Memory Structure

KV-based memory reuse via KVCacheMemory offers substantial latency reduction across model sizes and query types, while maintaining identical output. By shifting reusable memory from plaintext prompts into precomputed KV caches, MemOS eliminates redundant context encoding and achieves faster response times—especially beneficial in real-time, memory-augmented LLM applications.

Each cache is stored as a KVCacheItem:

Field	Type	Description
`kv_cache_id`	`str`	Unique ID for the cache (UUID)
`kv_cache`	`DynamicCache`	The actual key-value cache (transformers)
`metadata`	`dict`	Metadata (source, extraction time, etc.)

API Summary (`KVCacheMemory`)

Initialization

KVCacheMemory(config: KVCacheMemoryConfig)

Core Methods

Method	Description
`extract(text)`	Extracts a KV cache from input text using the LLM
`add(memories)`	Adds one or more `KVCacheItem` to memory
`get(memory_id)`	Fetch a single cache by ID
`get_by_ids(ids)`	Fetch multiple caches by IDs
`get_all()`	Returns all stored caches
`get_cache(cache_ids)`	Merge and return a combined cache from multiple IDs
`delete(ids)`	Delete caches by IDs
`delete_all()`	Delete all caches
`dump(dir)`	Serialize all caches to a pickle file in directory
`load(dir)`	Load caches from a pickle file in directory
`from_textual_memory(mem)`	Convert a `TextualMemoryItem` to a `KVCacheItem`

When calling dump(dir), the system writes to:

<dir>/<config.memory_filename>

This file contains a pickled dictionary of all KV caches, which can be reloaded using load(dir).

How to Use

from memos.configs.memory import KVCacheMemoryConfig
from memos.memories.activation.kv import KVCacheMemory

config = KVCacheMemoryConfig(
    extractor_llm={
        "backend": "huggingface",
        "config": {"model_name_or_path": "Qwen/Qwen3-1.7B"}
    }
)
mem = KVCacheMemory(config)

# Extract and add a cache
cache_item = mem.extract("The capital of France is Paris.")
mem.add([cache_item])

# Retrieve and merge caches
merged_cache = mem.get_cache([cache_item.kv_cache_id])

# Save/load
mem.dump("tmp/act_mem")
mem.load("tmp/act_mem")

Developer Notes

Uses HuggingFace DynamicCache for efficient key-value storage
Pickle-based serialization for fast load/save
All methods are covered by integration tests in /tests

MemScheduler

MemScheduler is a concurrent memory management system parallel running with the MemOS system, which coordinates memory operations between working memory, long-term memory, and activation memory in AI systems. It handles memory retrieval, updates, and compaction through event-driven scheduling. <br/> This system is particularly suited for conversational agents and reasoning systems requiring dynamic memory management.

General Textual Memory

GeneralTextMemory is a flexible, vector-based textual memory module in MemOS, designed for storing, searching, and managing unstructured knowledge. It is suitable for conversational agents, personal assistants, and any system requiring semantic memory retrieval.