T
Two Minute Papers
#DeepSeek AI#multimodal LLM#visual primitives

DeepSeek AI: Thinking with Visual Primitives for Enhanced Multimodal Reasoning

DeepSeek AI introduces 'Thinking with Visual Primitives,' a novel framework that enhances multimodal large language models by integrating spatial markers and bounding boxes directly into the reasoning process. This approach improves accuracy, efficiency, and interpretability for complex visual tasks.

5 min readAI Guide

Introduction

Introduction
DeepSeek AI's 'Thinking with Visual Primitives' is a novel reasoning framework that integrates spatial markers and bounding boxes directly into the AI's thought process, bridging the 'Perception Gap' and 'Reference Gap' in multimodal large language models (MLLMs). This allows AI systems to 'point' at specific visual elements while reasoning, leading to more accurate, efficient, and interpretable outcomes for complex visual tasks.

Configuration Checklist

Element Version / Link
Language / Runtime Not explicitly stated, but typically Python for AI/ML
Main library DeepSeek-V4-Flash (model architecture)
Required APIs Not explicitly stated, but implies internal DeepSeek APIs for visual primitive integration
Keys / credentials needed Not applicable for the conceptual framework, but for running the model on Lambda GPU Cloud, user credentials are required.

Step-by-Step Guide

This section outlines the conceptual steps and mechanisms of 'Thinking with Visual Primitives' as demonstrated by DeepSeek AI, focusing on how the model processes visual information and performs reasoning tasks.

Step 1 — Grounding Language to Visuals

The core idea is to enable the AI to directly reference specific physical coordinates or regions within an image, rather than relying solely on linguistic descriptions. This is achieved by introducing visual primitives like points and bounding boxes into the reasoning process.

// User query: "Count the number of men in this image."
// AI's internal thought process using visual primitives:

// 1. Deconstructing the query:
// The user wants me to count the total number of men in the image. This looks like a team photo. I need to count everyone present: the players in the team kit (striped jerseys) as well as the coaches or officials standing on the sides in regular clothes.

// 2. Sweeping the photo for men:
// Looking at the group, I'll identify all the men across the different rows. Here they are:
// <ref> the men </ref> | <box> [x1,y1,x2,y2] </box> // Example bounding box for a person
// This covers the players in the front sitting down, the middle row seated or crouching, the back row standing, and the four non-players (two on the far left, two on the far right) dressed in suits/sweaters.

// 3. Tallying the group:
// Counting everyone up, there are 4 guys sitting on the ground in the front row, 9 players seated in the middle row, 8 players standing in the back row, plus 2 coaches/officials on the left and 2 on the right. 4 + 9 + 8 + 2 + 2 = 25. The total number is 25.

// Response:
// There are 25 men in this image.

Step 2 — Topological Reasoning for Maze Navigation

DeepSeek can solve mazes by identifying the start and end points and then systematically exploring paths. The visual primitives allow the AI to 'trace' its path and make decisions based on spatial relationships.

// User query: "I'll use a trial-and-error strategy to explore this maze. First locate the starting point: <point> [(309,512)] </point>, and the destination: <point> [(652,644)] </point>."

// AI's exploration process using visual primitives:
// **Start Exploring**:
// **Step 1**: Reaching <point> [(357,369)] </point> after going upper-right, I face 2 directions to choose from (lower-right, left). Lower-right direction might also be correct, but I'll explore left direction first.
// **Step 2**: Moving left: <point> [(357,369)] <point> [(260,372)] </point> -> reaching <point> [(260,372)] </point>, I see 2 forks here. Observing that upper-right direction is a straight dead end at <point> [(310,226)] </point>, can see wall directly. Still need to explore left direction.
// ... (further steps of exploration with point coordinates)
// **Step 10**: Up ahead <point> [(652,645)] </point> is the destination! Through systematic exploration, I find the correct path.

// **Final Path**: After exploration, the correct route is:
// <point> [(310,513)] <point> [(357,369)] <point> [(408,510)] <point> [(457,650)] <point> [(506,507)] <point> [(603,507)] <point> [(652,645)] </point>

// Response:
// The maze is solvable. The verified path is:
// <boxed>[True]</boxed> <point> [(310,513)] <point> [(357,369)] <point> [(408,510)] <point> [(457,650)] <point> [(506,507)] <point> [(603,507)] <point> [(652,645)] </point>!

Step 3 — Tracing Connections in Complex Visuals

This capability allows the AI to identify and follow complex paths or connections between various elements in an image, providing both the answer and a visual trace of its reasoning.

// User query: "Where does the crown connect?"
// AI's reasoning process:
// 1. Identify the crown: <point> [(198,160)] </point>
// 2. Trace the path connected to the crown.
// 3. Identify the endpoint of the traced path: <point> [(650,190)] </point> (octopus)

// Response:
// The crown connects to the octopus.
// Visual trace: (path highlighted from crown to octopus)

Step 4 — Running the DeepSeek Model (Example with Ollama)

DeepSeek models can be run locally or on cloud GPUs. The video demonstrates running a DeepSeek R1 model via Ollama, a tool for running large language models locally.

ollama run deepseek-r1:671b

Once the model is running, you can send messages. The video shows an example of explaining transformers using only emojis:

>>> Use only emoji to explain how a transformer neural network works and its advantage. Be creative!
<think>
Okay, the user wants me to use only emojis to explain how a transformer neural network works and its advantage.
First, transformers process sequences of data, like sentences. So maybe an input symbol like 📝 or some processed sequentially
Need to make sure it's coherent as emojis. Let me try structuring step-by-step.
</think>

**🧠Transformer Explained (Emoji Edition):**

➡️ Input: 📝➡️🔢 (Text → Tokens)
1️⃣2️⃣3️⃣ (Add Positional Encoding)

👀**Self-Attention**:
- Words 🗣️↔️🗣️ Each Other ("Cat loves fish" 🐱❤️🐟)
- Focus: 🎯 (Context Links Everywhere!)

🧠**Layers Stacked**: 🥞🥞🥞 (Deep Processing)
⚡ No Sequence Dependency! (vs. RNNs 🐌)

💡 Output: ✨

Step 5 — Training with On-Policy Distillation

Step 5 — Training with On-Policy Distillation
The underlying methodology for training DeepSeek's 'Thinking with Visual Primitives' involves a technique called On-Policy Distillation. This approach allows a single, smaller 'student' model to learn from the diverse expertise of multiple larger 'teacher' models, each specialized in different visual reasoning tasks.

\mathcal{L}_{OPD}(\theta) = \sum_{i=1}^{N} w_i \cdot D_{KL}(\pi_{\theta} \| \pi_{E_i})

Inline Explanatory Comments:

  • L_OPD(θ): The On-Policy Distillation loss function for the student model's parameters (θ).
  • N: The total number of expert teacher models.
  • w_i: A weighting factor for each expert model E_i, allowing some experts to contribute more to the student's learning.
  • D_KL(π_θ || π_Ei): The Kullback-Leibler (KL) divergence between the student model's policy (π_θ) and the policy of an expert model (π_Ei). This measures how much the student's actions deviate from the expert's actions. The goal is to minimize this divergence, making the student mimic the experts.

This process effectively distills the knowledge and reasoning capabilities of multiple specialized models into a single, more generalized and efficient student model, enabling it to perform a wide range of visual thinking tasks.

Comparison Tables

Comparison Tables

(a) Token Efficiency (KV Cache Entries)

Lower is better, indicating fewer visual tokens needed for processing an 800x800 input image.

Model KV Cache Entries (approx.)
Gemma-4-31B 289
Ours-284B-A13B (DeepSeek) 361 (~90 entries in KV Cache)
Qwen3-VL-235B-A22B 660
GPT-5.4 740
Claude-Sonnet-4.6 870
Gemini-3-Flash 1100

(b) Average Score on Selected Benchmarks

Higher is better, indicating better performance on a subset of evaluation dimensions (counting and spatial reasoning), excluding in-house benchmarks.

Model Average Score (%)
Gemma-4-31B 69.7
Ours-284B-A13B (DeepSeek) 77.2
Qwen3-VL-235B-A22B 68.1
GPT-5.4 71.1
Claude-Sonnet-4.6 65.3
Gemini-3-Flash 76.5

⚠️ Common Mistakes & Pitfalls

  1. Reliance on Explicit Trigger Words: The current 'thinking with visual primitives' capability does not automatically invoke visual reasoning. It relies on explicit trigger words in the prompt (e.g., "locate," "count," "trace") to activate this mechanism. Future work aims for autonomous invocation based on context.
  2. Limitations of Bounding Boxes for Fine-Grained Detail: While bounding boxes are effective for general object detection and spatial reasoning, they may not be ideal for tasks requiring extremely fine-grained analysis, such as counting individual blades of grass or strands of hair. More precise referential mechanisms might be needed for such scenarios.
  3. Generalization Challenges for Topological Reasoning: The topological reasoning capabilities, while impressive for structured tasks like mazes, might not generalize as robustly to completely novel or abstract visual structures that differ significantly from the training data. Further research is needed to enhance generalization.
  4. Misleading Benchmark Interpretations: It's crucial to critically evaluate benchmark results. Some models might perform exceptionally well on custom, in-house benchmarks that are specifically tailored to their strengths, potentially inflating their perceived capabilities. DeepSeek's paper explicitly excludes in-house benchmarks from its reported average scores, focusing on public benchmarks for a more objective comparison.

Glossary

Visual Primitives: Fundamental spatial markers, such as points and bounding boxes, that AI systems use to directly reference and interact with specific physical coordinates or regions within an image during their reasoning process.

Perception Gap: The inherent challenge in multimodal AI where large language models struggle to effectively integrate and reason with high-resolution visual information, often overlooking subtle but crucial spatial layouts and relationships.

On-Policy Distillation: A machine learning training technique where a smaller, more efficient 'student' model learns by mimicking the decision-making policies of multiple larger, specialized 'expert' models, thereby consolidating diverse knowledge into a single, capable system.

Key Takeaways

  • DeepSeek's 'Thinking with Visual Primitives' introduces a paradigm shift from merely processing more pixels to developing more precise and less ambiguous referential mechanisms that bridge the gap between language and the visual world.
  • The framework significantly improves the accuracy of multimodal AI systems in tasks requiring detailed spatial reasoning and visual question answering.
  • By using visual primitives, DeepSeek achieves competitive performance with significantly lower image-token budgets, making it more efficient and cost-effective than many frontier models.
  • The ability to visually trace the AI's thought process (e.g., maze navigation) enhances interpretability, making it easier to understand how conclusions are reached and to debug errors.
  • The On-Policy Distillation training pipeline allows a single model to learn diverse visual reasoning skills from multiple expert models, promoting comprehensive intelligence.
  • This open research contributes to making advanced AI capabilities more accessible and affordable, fostering innovation beyond proprietary, resource-intensive models.
  • While powerful, the current implementation requires explicit trigger words for activation and may face generalization challenges with entirely new topological structures or extremely fine-grained visual tasks.

Resources

  • Paper: Thinking with Visual Primitives
  • DeepSeek AI Website: DeepSeek AI
  • Lambda GPU Cloud: lambda.ai/papers
  • GitHub Repository for Benchmarks (UC Berkeley): github.com/moogician/trustworthy-env
  • Ollama: ollama.com
  • Source: [Daviet 2023]: [Link to Daviet 2023 paper/project if available, otherwise note as 'Not explicitly provided in video']
  • Source: [Fei et al. 2021]: [Link to Fei et al. 2021 paper/project if available, otherwise note as 'Not explicitly provided in video']
  • Source: [awnlhannun]: [Link to awnlhannun's project/code if available, otherwise note as 'Not explicitly provided in video']