Instructions to use CMB-AI-LAB/lagkv_cache with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CMB-AI-LAB/lagkv_cache with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("CMB-AI-LAB/lagkv_cache", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - custom_generate | |
| # LagKV Cache | |
| ## Introduction | |
|  | |
| LagKV is an efficient and robust KV compression algorithm. It uses lag tokens information to compress the previous ones which significantly boost the compression performance with little computation overhead. | |
| [Original Github](https://github.com/AI-Lab-China-Merchants-Bank/LagKV) | |
| Details are in the following work: | |
| [LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important](https://arxiv.org/abs/2504.04704) | |
| ## Example usage | |
| We can use the custom generation method in this repository like the the base `generate` from `transformers`: | |
| ```py | |
| # requires `transformers>=4.52.0` | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # Preparing model, tokenizer, and model inputs | |
| tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B") | |
| model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", device_map="auto") | |
| messages = [{"role": "user", "content": "Tell me a story about a cat."}] | |
| text = tokenizer.apply_chat_template( | |
| messages, | |
| tokenize=False, | |
| add_generation_prompt=True, | |
| enable_thinking=False | |
| ) | |
| model_inputs = tokenizer([text], return_tensors="pt").to(model.device) | |
| # Using lagkv cache | |
| gen_out = model.generate( | |
| # usual `generate` arguments | |
| **model_inputs, | |
| do_sample=False, | |
| max_new_tokens=100, | |
| return_dict_in_generate=True, | |
| # lagkv cache arguments (default `lag_ratio=0.5,lag_size=128,lag_sink_size=16`) | |
| custom_generate="CMB-AI-LAB/lagkv_cache", | |
| trust_remote_code=True, | |
| ) | |
| print(tokenizer.batch_decode(gen_out.sequences, skip_special_tokens=True)) | |
| assert "lagkvcache" in str(type(gen_out.past_key_values)).lower() |