NVIDIA Nemotron 3 Super: Open-Source AI Model for Agile Reasoning
Explore NVIDIA's Nemotron 3 Super, an open-source AI model detailed in a 51-page research paper. Learn about its architecture, training data, and innovative techniques like NVFP4, Multi-Token Prediction, Mamba layers, and Stochastic Rounding for enhanced speed and accuracy.
Introduction
NVIDIA Nemotron 3 Super is an open-source, 120-billion parameter AI assistant designed for agile reasoning, offering performance comparable to leading closed-source models from a year and a half ago. It provides full transparency into its architecture, training methods, and dataset, enabling broader research and development in the AI community.
Configuration Checklist
| Element | Version / Link |
|---|---|
| Language / Runtime | Python (implied) |
| Main library | OpenCode (client application) |
| Required APIs | NVIDIA NIM (NVIDIA Inference Microservice) |
| Keys / credentials needed | apiKey: YOUR KEY HERE (for NVIDIA NIM) |
| Model | nvidia/nemotron-3-super-120b-a12b |
| Base URL | https://integrate.api.nvidia.com/v1 |
Step-by-Step Guide
Step 1 — Install OpenCode
To begin, you need to install the OpenCode client application. While the video does not provide the exact installation command, it is typically available via package managers or direct download from the official OpenCode repository.
# [Editor's note: Command to install OpenCode needs to be verified in the official documentation]
# Example (hypothetical, verify with OpenCode documentation):
# pip install opencode-ai
# or


# brew install opencode-ai # For macOS
Step 2 — Configure OpenCode for Nemotron 3 Super
Configure OpenCode to use the Nemotron 3 Super model by editing its configuration file. This involves specifying the model ID, provider details, and your API key for NVIDIA NIM.
// opencode.json.demo
{
"$schema": "https://opencode.ai/config.json",
"model": "nvidia/nemotron-3-super-120b-a12b", // Specifies the Nemotron 3 Super model ID
"provider": {
"nvidia": {
"npm": "@ai-sdk/openai-compatible", // Uses an OpenAI-compatible SDK for NVIDIA
"name": "NVIDIA NIM", // Names the provider as NVIDIA NIM
"options": {
"baseURL": "https://integrate.api.nvidia.com/v1", // API endpoint for NVIDIA NIM
"apiKey": "YOUR KEY HERE" // Placeholder for your NVIDIA API key
}
}
},
"models": {
"nvidia/nemotron-3-super-120b-a12b": {
"name": "Nemotron 3 Super" // Friendly name for the configured model
}
}
}
Step 3 — Interact with the AI Assistant
Once configured, you can interact with Nemotron 3 Super through the OpenCode interface by providing prompts. The AI will process your request and generate a response.
# Assuming OpenCode is running and configured
# You would type your prompt directly into the OpenCode terminal interface.
# Example prompt shown in video:
# In the MemoryItem class, add a function to update a specific memory item per id. add relevant documentation
Comparison Tables
Artificial Analysis Intelligence Index: Open Weights, <250B Parameters
| Model | Score |
|---|---|
| MiniMax-ML3.5 | 42 |
| Qwen2.5-122B-A10B | 42 |
| Mixtral-8x7B | 39 |
| NVIDIA Nemotron 3 Super | 36 |
| GLM-4-9B-Flash | 33 |
| Qwen2.5-32B-A10B | 30 |
| GLM-4-8B-Flash | 30 |
| Qwen2.5-72B-A10B | 27 |
| NVIDIA Nemotron 3 (high) | 24 |
| GLM-4-6B-Air | 23 |
| Nemotron-3-8B-Instruct | 19 |
Accuracy Comparison
| Benchmark | Nemotron-3-Super-120B-A12B-BF16 (%) | Nemotron-3-Super-120B-A12B-NVFP4 (%) | GPT-OSS-120B-A5B-MXFP4 (%) | Qwen3.5-122B-A10B-BF16 (%) |
|---|---|---|---|---|
| IFBench (Inst. Following) | 72.6 | 73.3 | 73.8 | 73.8 |
| HMMT Feb25 (Math) | 94.7 | 95.4 | 90.0 | 91.4 |
| SWE-Bench (Coding) | 60.5 | 60.5 | 41.9 | 66.4 |
| HLE (Science) | 22.8 (+tools) | 18.7 | 17.4 | 19.0 |
| Term. Bench Hard (Terminal Use) | 25.3 | 25.8 | 24.5 | 26.8 |
Throughput Comparison (Relative tokens/s/GP)
| Model Version | ISL/OSL 8k/64k |
|---|---|
| NVFP4 | 2.2 |
| BF16 | 0.6 |
| OSL (other model) | 1.0 |
| ISL (other model) | 0.3 |
Lambda GPU Cloud Pricing (Example Instances)
| VRAM/GPU | vCPUs | RAM | Storage | PRICE/GPU/HR* |
|---|---|---|---|---|
| NVIDIA GH200 | 64 | 432 GiB | 4 TiB SSD | $1.49 |
| NVIDIA H100 SXM | 80 | 225 GiB | 7.2 TiB SSD | $3.29 |
| NVIDIA H100 PCIe | 80 | 225 GiB | 7.2 TiB SSD | $2.49 |
| NVIDIA A100 SXM | 80 | 225 GiB | 1.3 TiB SSD | $1.99 |
| NVIDIA A100 PCIe | 80 | 225 GiB | 1.3 TiB SSD | $1.29 |
| NVIDIA A10 | 24 | 186 GiB | 1.3 TiB SSD | $0.75 |
| NVIDIA A6000 | 48 | 186 GiB | 1.3 TiB SSD | $0.79 |
| NVIDIA Quadro RTX 6000 | 24 | 186 GiB | 1.3 TiB SSD | $0.59 |
⚠️ Common Mistakes & Pitfalls
- Loss of Accuracy with Naive Quantization: Directly compressing the mathematics in AI models by rounding numbers can lead to significant loss of accuracy, causing the system to output nonsensical results. The fix involves Smart Quantization, where only less sensitive calculations are rounded, preserving critical precision. (02:43)
- Slow Inference from Sequential Token Generation: Traditional AI models generate responses token by token (word by word), which is inefficient and slow for longer outputs. The solution is Multi-Token Prediction (MTP), allowing the model to calculate and verify several future tokens (e.g., 7 tokens) simultaneously, drastically speeding up response generation. (03:47)
- Inefficient Context Handling (Memory Problem): AI systems often re-read the entire input context for each new token, akin to a student constantly re-reading a textbook. This is memory-intensive and slow. Mamba layers address this by reading the input once and taking highly compressed notes, remembering important details while discarding filler words, leading to efficient processing of massive data. (04:20)
- Accumulation of Errors in Recurrent Models: In recurrent generation, small rounding errors from each step can accumulate and magnify over many steps, leading to a drift from the correct answer. Stochastic Rounding (SR) mitigates this by adding carefully crafted random noise that averages to zero over time, ensuring that while individual steps might vary, the overall output remains accurate. (05:10)
Glossary
NVFP4: A specialized numerical format (NVIDIA Floating Point 4) designed to accelerate AI inference by compressing mathematical operations, leading to significantly faster throughput with minimal accuracy loss.
Multi-Token Prediction (MTP): An optimization technique where an AI model predicts and verifies multiple future tokens (words or sub-word units) in parallel, rather than sequentially, to improve inference speed.
Mamba layers: A novel architecture component that improves memory efficiency in AI models by processing input sequences once and storing highly compressed, relevant information, avoiding redundant re-reading of context.
Stochastic Rounding (SR): A quantization technique that introduces controlled random noise during numerical rounding, preventing the systematic accumulation of errors in recurrent neural networks and maintaining overall accuracy over many computational steps.
Key Takeaways
- NVIDIA Nemotron 3 Super is a 120-billion parameter open-source AI model, offering transparency and accessibility for AI development.
- It achieves performance comparable to closed-source frontier models from a year and a half ago, making advanced AI more widely available.
- Innovative techniques like NVFP4 quantization enable up to 7x faster inference without significant accuracy degradation.
- Multi-Token Prediction (MTP) significantly boosts generation speed by predicting and verifying multiple tokens concurrently.
- Mamba layers enhance memory efficiency by intelligently compressing and retaining context, allowing for efficient processing of large datasets.
- Stochastic Rounding addresses error accumulation in recurrent models, ensuring long-term accuracy despite numerical compression.
- NVIDIA is reportedly investing billions into open-weight AI models, signaling a shift towards more open and collaborative AI development.
- The model can be deployed and experimented with on powerful GPU cloud platforms like Lambda GPU Cloud.
Resources
- NVIDIA Nemotron 3 Super Research Paper: [Editor's note: Specific link to the paper is not provided in the video, but it's titled "NVIDIA-Nemotron-3-8B-Instruct: Efficient Micro-Mixture-of-Experts Hybrid Mixture-of-Experts Model for Agile Reasoning"] - Search for the full title on arXiv or NVIDIA's research page.
- Lambda GPU Cloud: lambda.ai/papers (for powerful NVIDIA GPU instances to run your own chatbots and experiments)
- OpenCode AI: [Editor's note: Official website or GitHub repo for OpenCode AI is not explicitly linked in the video, search for "OpenCode AI" to find it.]