DeepSeek V4: Open-Source 1M Context LLM with Advanced Compression

DeepSeek V4 introduces open-source LLMs with a 1 million token context window, offering cost-effective and high-performance AI. It utilizes hybrid attention and KV cache compression to rival top closed-source models.

5 min readAI Guide

Introduction

DeepSeek V4 offers open-source, cost-effective large language models (LLMs) with a 1 million token context length, enabling developers to process extensive documentation and build advanced agentic applications. It achieves performance comparable to top closed-source models through innovative compression and attention mechanisms.

Configuration Checklist

Element	Version / Link
Language / Runtime	JavaScript (for coding examples), Python (for Ollama)
Main library	DeepSeek V4 (Pro and Flash versions)
Required APIs	DeepSeek API (available at chat.deepseek.com)
Keys / credentials needed	API keys for DeepSeek API; GPU cloud access (e.g., Lambda GPU Cloud) for self-hosting

Step-by-Step Guide

Step 1 — Understanding the Conversational Flow with Tools

Step 1 — Understanding the Conversational Flow with Tools
DeepSeek V4 models are designed to interact with users in a multi-turn conversation, leveraging external tools to enhance their capabilities. This iterative process allows the model to refine its understanding and generate more accurate responses.

graph TD
    A[Input: User message 1] --> B{Thinking 1}
    B --> C[Output: Answer 1]
    C --> D[Input: User message 1, Answer 1, User message 2]
    D --> E{Thinking 2}
    E --> F[Output: Answer 2]
    F --> G[Input: User message 1, Answer 1, User message 2, Answer 2, User message 3]
    G --> H{Thinking 3}
    H --> I[Output: Answer 3]

Step 2 — Running DeepSeek via Ollama (Conceptual)

For local inference or experimentation, DeepSeek models can be run using Ollama, a tool for running large language models locally. This allows developers to leverage the model's capabilities without relying on external API services.

ollama run deepseek-r1:671b
# [Editor's note: The specific model name 'deepseek-r1:671b' was shown in the video, but DeepSeek V4 models are typically named 'deepseek-v4-pro' or 'deepseek-v4-flash' on platforms like Hugging Face. Verify the exact Ollama model name in the official Ollama library.]

Step 3 — Hybrid Attention Architecture for Long Context

Step 3 — Hybrid Attention Architecture for Long Context
DeepSeek V4 employs a hybrid attention architecture that combines two key mechanisms to efficiently handle a 1 million token context length:

Compressed Sparse Attention (CSA): This technique allows the model to selectively focus on the most relevant tokens within the vast context, similar to how a human might use an index to find specific information in a book. Instead of processing every single token, it identifies and attends to the most important ones.
Heavily Compressed Attention (HCA): HCA further compresses the key-value (KV) cache, which stores intermediate representations of tokens. This is akin to summarizing each paragraph of a book into a single sentence, allowing the model to retain a high-level understanding of the entire document without consuming excessive memory.

These mechanisms work together with a "Lightning Indexer" to quickly retrieve and process information from the compressed context, significantly reducing computational overhead.

graph TD
    A[Queries] --> B(Multi-Query Attention)
    C[Hidden State of Query Token] --> B
    D[Compressed Indexer Keys] --> B
    B --> E[Index Scores]
    E --> F(Top-k Selector)
    G[Compressed KV Entries] --> F
    H[Token-Level Compressor] --> G
    I[Hidden States of KV Tokens] --> H
    F --> J[Selected Compressed KV Entries]
    J --> K(Concatenation)
    L[Sliding Window KV Entries] --> K
    K --> M[Shared Key-Value Multi-Query Attention]
    C --> M
    I --> H
    C --> D
    H --> D

Comparison Tables

DeepSeek V4 Model Comparison

Model	Total Params	Active Params	Pre-trained Tokens	Context Length	Open Source	API Service	WEB/APP Mode
deepseek-v4-pro	1.6T	49B	33T	1M	✓	✓	Expert
deepseek-v4-flash	284B	13B	32T	1M	✓	✓	Instant

Accuracy/Pass@1 Benchmarks

Benchmark (metric)	DeepSeek-V4-Pro-Max	Claude-Opus-4.6-Max	GPT-5.4-xHigh	Gemini-3.1-Pro-High
SimpleQA Verified (Pass@1)	57.9	46.2	45.3	75.6
HLE (Pass@1)	37.7	40.0	39.9	44.4
Apex Shortlist (Pass@1)	90.2	85.9	78.1	-
Codeforces (Rating)	3206	3208	3168	3052
SWE Verified (Resolved)	80.6	80.6	80.6	-
Terminal Bench 2.0 (Acc)	67.9	65.4	68.5	75.1
Toolathlon (Pass@1)	51.8	47.2	48.8	54.6

Efficiency Comparison (Single-Token FLOPs and KV Cache)

Model	Single-Token FLOPs (relative to V3.2)	Accumulated KV Cache (relative to V3.2)
DeepSeek-V4-Pro	3.7x lower	9.5x smaller
DeepSeek-V4-Flash	9.8x lower	13.7x smaller

World Knowledge Benchmarks (Accuracy)

Benchmark (Metric)	# Shots	DeepSeek-V3.2 Base (Previous)	DeepSeek-V4-Flash Base (New)	DeepSeek-V4-Pro Base (New)
AGIeval (EM)	0-shot	80.1	82.6	83.1
MMLU (EM)	5-shot	87.8	88.7	90.1
MMLU-Redux (EM)	5-shot	87.9	88.9	90.8
MMLU-Pro (EM)	5-shot	65.5	68.3	73.5
MMMU (EM)	5-shot	87.9	88.9	90.3
C-Eval (EM)	5-shot	90.4	92.1	93.1
CMMMLU (EM)	5-shot	88.9	90.4	90.8
MultiLoKo (EM)	5-shot	38.7	42.2	51.1
Simple-QA verified (EM)	25-shot	28.3	30.1	55.2
SuperGPQA (EM)	25-shot	45.0	46.5	53.9
FACTS Parametric (EM)	25-shot	27.1	33.9	62.6
TriviaQA (EM)	5-shot	83.3	85.0	85.6

⚠️ Common Mistakes & Pitfalls

Overestimating Multimodality: DeepSeek V4 is currently a unimodal (text-only) system. It does not process images or audio directly. Fix: Understand its current scope and avoid expecting multimodal capabilities. Future versions may introduce these features.
Misinterpreting KV Cache Compression: While KV cache compression significantly reduces memory needs for inference (up to 90% smaller), the full model still needs to be loaded into memory. Fix: Ensure your hardware has sufficient GPU memory to load the base model, even if the inference process is more efficient.
Context Window Degradation: Like many LLMs, DeepSeek V4's performance can degrade when operating at the extreme limits of its 1 million token context window. Fix: Be mindful of the context length and test performance at scale. While it supports 1M tokens, accuracy might slightly decrease at the very end of the window.
Lack of Full Theoretical Understanding: The paper transparently states that some underlying mechanisms, particularly those contributing to training stability, are not yet fully understood by the creators. Fix: Approach with an experimental mindset and contribute to community exploration, recognizing that AI research is an evolving field.

Glossary

Context Length: The maximum number of tokens an AI model can process and consider at once to generate a response.
KV Cache: A memory mechanism in transformer models that stores previously computed key and value vectors for attention layers, speeding up subsequent token generation.
Mixture of Experts (MoE): An architecture where different "expert" neural networks specialize in different types of data or tasks, and a "gate" network learns to select which experts to use for a given input.
Token-Level Compressor: A component that compresses individual tokens or small groups of tokens within the KV cache to reduce memory footprint and improve efficiency.
Heavily Compressed Attention (HCA): An attention mechanism that applies significant compression to key-value pairs to improve long-context efficiency.
Compressed Sparse Attention (CSA): An attention mechanism that selectively focuses on a subset of relevant tokens rather than all tokens in the context, combined with compression for efficiency.

Key Takeaways

DeepSeek V4 offers open-source LLMs with an impressive 1 million token context window, making long-form content processing highly accessible.
The release includes two models: DeepSeek-V4-Pro (1.6T total params, 49B active) and DeepSeek-V4-Flash (284B total params, 13B active).
DeepSeek-V4-Pro demonstrates performance comparable to or exceeding top closed-source models like Claude Opus and Gemini 3.1 Pro in various benchmarks.
Significant efficiency improvements are achieved through a hybrid attention architecture, resulting in 3.7x to 9.8x lower FLOPs and a 9.5x to 13.7x smaller KV cache compared to DeepSeek-V3.2.
The architecture incorporates novel techniques like Compressed Sparse Attention (CSA), Heavily Compressed Attention (HCA), and a Lightning Indexer for efficient long-context handling.
DeepSeek V4 is highly capable at code generation, allowing users to create and run complex JavaScript programs directly within the DeepSeek environment.
The API pricing is remarkably cost-effective, offering rates potentially 8 to 30 times cheaper than competitors like Anthropic's Claude.
While powerful, DeepSeek V4 is currently unimodal (text-only) and its creators acknowledge that some underlying mechanisms are still subjects of ongoing research.

Resources

DeepSeek AI Chat: chat.deepseek.com
DeepSeek-V4-Pro Tech Report: huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
DeepSeek-V4 Open Weights: huggingface.co/collections/deepseek-ai/deepseek-v4
Lambda GPU Cloud: lambda.ai/papers
Ollama Documentation: [Editor's note: Link to official Ollama documentation for DeepSeek models, if available, to verify model names and usage instructions.]

All guides Lire en français →