DeepSeek V4: Open-Source 1M Context LLM with Advanced Compression
DeepSeek V4 introduces open-source LLMs with a 1 million token context window, offering cost-effective and high-performance AI. It utilizes hybrid attention and KV cache compression to rival top closed-source models.
Introduction
DeepSeek V4 offers open-source, cost-effective large language models (LLMs) with a 1 million token context length, enabling developers to process extensive documentation and build advanced agentic applications. It achieves performance comparable to top closed-source models through innovative compression and attention mechanisms.
Configuration Checklist
| Element | Version / Link |
|---|---|
| Language / Runtime | JavaScript (for coding examples), Python (for Ollama) |
| Main library | DeepSeek V4 (Pro and Flash versions) |
| Required APIs | DeepSeek API (available at chat.deepseek.com) |
| Keys / credentials needed | API keys for DeepSeek API; GPU cloud access (e.g., Lambda GPU Cloud) for self-hosting |
Step-by-Step Guide
Step 1 — Understanding the Conversational Flow with Tools

DeepSeek V4 models are designed to interact with users in a multi-turn conversation, leveraging external tools to enhance their capabilities. This iterative process allows the model to refine its understanding and generate more accurate responses.
graph TD
A[Input: User message 1] --> B{Thinking 1}
B --> C[Output: Answer 1]
C --> D[Input: User message 1, Answer 1, User message 2]
D --> E{Thinking 2}
E --> F[Output: Answer 2]
F --> G[Input: User message 1, Answer 1, User message 2, Answer 2, User message 3]
G --> H{Thinking 3}
H --> I[Output: Answer 3]
Step 2 — Running DeepSeek via Ollama (Conceptual)
For local inference or experimentation, DeepSeek models can be run using Ollama, a tool for running large language models locally. This allows developers to leverage the model's capabilities without relying on external API services.
ollama run deepseek-r1:671b
# [Editor's note: The specific model name 'deepseek-r1:671b' was shown in the video, but DeepSeek V4 models are typically named 'deepseek-v4-pro' or 'deepseek-v4-flash' on platforms like Hugging Face. Verify the exact Ollama model name in the official Ollama library.]
Step 3 — Hybrid Attention Architecture for Long Context

DeepSeek V4 employs a hybrid attention architecture that combines two key mechanisms to efficiently handle a 1 million token context length:
- Compressed Sparse Attention (CSA): This technique allows the model to selectively focus on the most relevant tokens within the vast context, similar to how a human might use an index to find specific information in a book. Instead of processing every single token, it identifies and attends to the most important ones.
- Heavily Compressed Attention (HCA): HCA further compresses the key-value (KV) cache, which stores intermediate representations of tokens. This is akin to summarizing each paragraph of a book into a single sentence, allowing the model to retain a high-level understanding of the entire document without consuming excessive memory.
These mechanisms work together with a "Lightning Indexer" to quickly retrieve and process information from the compressed context, significantly reducing computational overhead.
graph TD
A[Queries] --> B(Multi-Query Attention)
C[Hidden State of Query Token] --> B
D[Compressed Indexer Keys] --> B
B --> E[Index Scores]
E --> F(Top-k Selector)
G[Compressed KV Entries] --> F
H[Token-Level Compressor] --> G
I[Hidden States of KV Tokens] --> H
F --> J[Selected Compressed KV Entries]
J --> K(Concatenation)
L[Sliding Window KV Entries] --> K
K --> M[Shared Key-Value Multi-Query Attention]
C --> M
I --> H
C --> D
H --> D
Comparison Tables

DeepSeek V4 Model Comparison
| Model | Total Params | Active Params | Pre-trained Tokens | Context Length | Open Source | API Service | WEB/APP Mode |
|---|---|---|---|---|---|---|---|
| deepseek-v4-pro | 1.6T | 49B | 33T | 1M | ✓ | ✓ | Expert |
| deepseek-v4-flash | 284B | 13B | 32T | 1M | ✓ | ✓ | Instant |
Accuracy/Pass@1 Benchmarks
| Benchmark (metric) | DeepSeek-V4-Pro-Max | Claude-Opus-4.6-Max | GPT-5.4-xHigh | Gemini-3.1-Pro-High |
|---|---|---|---|---|
| SimpleQA Verified (Pass@1) | 57.9 | 46.2 | 45.3 | 75.6 |
| HLE (Pass@1) | 37.7 | 40.0 | 39.9 | 44.4 |
| Apex Shortlist (Pass@1) | 90.2 | 85.9 | 78.1 | - |
| Codeforces (Rating) | 3206 | 3208 | 3168 | 3052 |
| SWE Verified (Resolved) | 80.6 | 80.6 | 80.6 | - |
| Terminal Bench 2.0 (Acc) | 67.9 | 65.4 | 68.5 | 75.1 |
| Toolathlon (Pass@1) | 51.8 | 47.2 | 48.8 | 54.6 |
Efficiency Comparison (Single-Token FLOPs and KV Cache)
| Model | Single-Token FLOPs (relative to V3.2) | Accumulated KV Cache (relative to V3.2) |
|---|---|---|
| DeepSeek-V4-Pro | 3.7x lower | 9.5x smaller |
| DeepSeek-V4-Flash | 9.8x lower | 13.7x smaller |
World Knowledge Benchmarks (Accuracy)
| Benchmark (Metric) | # Shots | DeepSeek-V3.2 Base (Previous) | DeepSeek-V4-Flash Base (New) | DeepSeek-V4-Pro Base (New) |
|---|---|---|---|---|
| AGIeval (EM) | 0-shot | 80.1 | 82.6 | 83.1 |
| MMLU (EM) | 5-shot | 87.8 | 88.7 | 90.1 |
| MMLU-Redux (EM) | 5-shot | 87.9 | 88.9 | 90.8 |
| MMLU-Pro (EM) | 5-shot | 65.5 | 68.3 | 73.5 |
| MMMU (EM) | 5-shot | 87.9 | 88.9 | 90.3 |
| C-Eval (EM) | 5-shot | 90.4 | 92.1 | 93.1 |
| CMMMLU (EM) | 5-shot | 88.9 | 90.4 | 90.8 |
| MultiLoKo (EM) | 5-shot | 38.7 | 42.2 | 51.1 |
| Simple-QA verified (EM) | 25-shot | 28.3 | 30.1 | 55.2 |
| SuperGPQA (EM) | 25-shot | 45.0 | 46.5 | 53.9 |
| FACTS Parametric (EM) | 25-shot | 27.1 | 33.9 | 62.6 |
| TriviaQA (EM) | 5-shot | 83.3 | 85.0 | 85.6 |
⚠️ Common Mistakes & Pitfalls
- Overestimating Multimodality: DeepSeek V4 is currently a unimodal (text-only) system. It does not process images or audio directly. Fix: Understand its current scope and avoid expecting multimodal capabilities. Future versions may introduce these features.
- Misinterpreting KV Cache Compression: While KV cache compression significantly reduces memory needs for inference (up to 90% smaller), the full model still needs to be loaded into memory. Fix: Ensure your hardware has sufficient GPU memory to load the base model, even if the inference process is more efficient.
- Context Window Degradation: Like many LLMs, DeepSeek V4's performance can degrade when operating at the extreme limits of its 1 million token context window. Fix: Be mindful of the context length and test performance at scale. While it supports 1M tokens, accuracy might slightly decrease at the very end of the window.
- Lack of Full Theoretical Understanding: The paper transparently states that some underlying mechanisms, particularly those contributing to training stability, are not yet fully understood by the creators. Fix: Approach with an experimental mindset and contribute to community exploration, recognizing that AI research is an evolving field.
Glossary
Context Length: The maximum number of tokens an AI model can process and consider at once to generate a response.
KV Cache: A memory mechanism in transformer models that stores previously computed key and value vectors for attention layers, speeding up subsequent token generation.
Mixture of Experts (MoE): An architecture where different "expert" neural networks specialize in different types of data or tasks, and a "gate" network learns to select which experts to use for a given input.
Token-Level Compressor: A component that compresses individual tokens or small groups of tokens within the KV cache to reduce memory footprint and improve efficiency.
Heavily Compressed Attention (HCA): An attention mechanism that applies significant compression to key-value pairs to improve long-context efficiency.
Compressed Sparse Attention (CSA): An attention mechanism that selectively focuses on a subset of relevant tokens rather than all tokens in the context, combined with compression for efficiency.
Key Takeaways
- DeepSeek V4 offers open-source LLMs with an impressive 1 million token context window, making long-form content processing highly accessible.
- The release includes two models: DeepSeek-V4-Pro (1.6T total params, 49B active) and DeepSeek-V4-Flash (284B total params, 13B active).
- DeepSeek-V4-Pro demonstrates performance comparable to or exceeding top closed-source models like Claude Opus and Gemini 3.1 Pro in various benchmarks.
- Significant efficiency improvements are achieved through a hybrid attention architecture, resulting in 3.7x to 9.8x lower FLOPs and a 9.5x to 13.7x smaller KV cache compared to DeepSeek-V3.2.
- The architecture incorporates novel techniques like Compressed Sparse Attention (CSA), Heavily Compressed Attention (HCA), and a Lightning Indexer for efficient long-context handling.
- DeepSeek V4 is highly capable at code generation, allowing users to create and run complex JavaScript programs directly within the DeepSeek environment.
- The API pricing is remarkably cost-effective, offering rates potentially 8 to 30 times cheaper than competitors like Anthropic's Claude.
- While powerful, DeepSeek V4 is currently unimodal (text-only) and its creators acknowledge that some underlying mechanisms are still subjects of ongoing research.
Resources
- DeepSeek AI Chat: chat.deepseek.com
- DeepSeek-V4-Pro Tech Report: huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
- DeepSeek-V4 Open Weights: huggingface.co/collections/deepseek-ai/deepseek-v4
- Lambda GPU Cloud: lambda.ai/papers
- Ollama Documentation: [Editor's note: Link to official Ollama documentation for DeepSeek models, if available, to verify model names and usage instructions.]