NVIDIA Nemotron 3 Super: Open-Source AI Model for Agile Reasoning

Explore NVIDIA's Nemotron 3 Super, an open-source AI model detailed in a 51-page research paper. Learn about its architecture, training data, and innovative techniques like NVFP4, Multi-Token Prediction, Mamba layers, and Stochastic Rounding for enhanced speed and accuracy.

5 min readAI Guide

Introduction

NVIDIA Nemotron 3 Super is an open-source, 120-billion parameter AI assistant designed for agile reasoning, offering performance comparable to leading closed-source models from a year and a half ago. It provides full transparency into its architecture, training methods, and dataset, enabling broader research and development in the AI community.

Configuration Checklist

Element	Version / Link
Language / Runtime	Python (implied)
Main library	OpenCode (client application)
Required APIs	NVIDIA NIM (NVIDIA Inference Microservice)
Keys / credentials needed	`apiKey: YOUR KEY HERE` (for NVIDIA NIM)
Model	`nvidia/nemotron-3-super-120b-a12b`
Base URL	`https://integrate.api.nvidia.com/v1`

Step-by-Step Guide

Step 1 — Install OpenCode

To begin, you need to install the OpenCode client application. While the video does not provide the exact installation command, it is typically available via package managers or direct download from the official OpenCode repository.

# [Editor's note: Command to install OpenCode needs to be verified in the official documentation]
# Example (hypothetical, verify with OpenCode documentation):
# pip install opencode-ai
# or

![Stochastic Rounding for Enhanced Accuracy](/api/generated/nvidia-nemotron-3-super-open-source-ai-model-for-agile-reasoning-ZQAz_H-2.png)

![Mamba Layers for Efficient Context Handling](/api/generated/nvidia-nemotron-3-super-open-source-ai-model-for-agile-reasoning-ZQAz_H-0.png)
# brew install opencode-ai # For macOS

Step 2 — Configure OpenCode for Nemotron 3 Super

Configure OpenCode to use the Nemotron 3 Super model by editing its configuration file. This involves specifying the model ID, provider details, and your API key for NVIDIA NIM.

// opencode.json.demo
{
  "$schema": "https://opencode.ai/config.json",
  "model": "nvidia/nemotron-3-super-120b-a12b", // Specifies the Nemotron 3 Super model ID
  "provider": {
    "nvidia": {
      "npm": "@ai-sdk/openai-compatible", // Uses an OpenAI-compatible SDK for NVIDIA
      "name": "NVIDIA NIM", // Names the provider as NVIDIA NIM
      "options": {
        "baseURL": "https://integrate.api.nvidia.com/v1", // API endpoint for NVIDIA NIM
        "apiKey": "YOUR KEY HERE" // Placeholder for your NVIDIA API key
      }
    }
  },
  "models": {
    "nvidia/nemotron-3-super-120b-a12b": {
      "name": "Nemotron 3 Super" // Friendly name for the configured model
    }
  }
}

Step 3 — Interact with the AI Assistant

Once configured, you can interact with Nemotron 3 Super through the OpenCode interface by providing prompts. The AI will process your request and generate a response.

# Assuming OpenCode is running and configured
# You would type your prompt directly into the OpenCode terminal interface.
# Example prompt shown in video:
# In the MemoryItem class, add a function to update a specific memory item per id. add relevant documentation

Comparison Tables

Artificial Analysis Intelligence Index: Open Weights, <250B Parameters

Model	Score
MiniMax-ML3.5	42
Qwen2.5-122B-A10B	42
Mixtral-8x7B	39
NVIDIA Nemotron 3 Super	36
GLM-4-9B-Flash	33
Qwen2.5-32B-A10B	30
GLM-4-8B-Flash	30
Qwen2.5-72B-A10B	27
NVIDIA Nemotron 3 (high)	24
GLM-4-6B-Air	23
Nemotron-3-8B-Instruct	19

Accuracy Comparison

Benchmark	Nemotron-3-Super-120B-A12B-BF16 (%)	Nemotron-3-Super-120B-A12B-NVFP4 (%)	GPT-OSS-120B-A5B-MXFP4 (%)	Qwen3.5-122B-A10B-BF16 (%)
IFBench (Inst. Following)	72.6	73.3	73.8	73.8
HMMT Feb25 (Math)	94.7	95.4	90.0	91.4
SWE-Bench (Coding)	60.5	60.5	41.9	66.4
HLE (Science)	22.8 (+tools)	18.7	17.4	19.0
Term. Bench Hard (Terminal Use)	25.3	25.8	24.5	26.8

Throughput Comparison (Relative tokens/s/GP)

Model Version	ISL/OSL 8k/64k
NVFP4	2.2
BF16	0.6
OSL (other model)	1.0
ISL (other model)	0.3

Lambda GPU Cloud Pricing (Example Instances)

VRAM/GPU	vCPUs	RAM	Storage	PRICE/GPU/HR*
NVIDIA GH200	64	432 GiB	4 TiB SSD	$1.49
NVIDIA H100 SXM	80	225 GiB	7.2 TiB SSD	$3.29
NVIDIA H100 PCIe	80	225 GiB	7.2 TiB SSD	$2.49
NVIDIA A100 SXM	80	225 GiB	1.3 TiB SSD	$1.99
NVIDIA A100 PCIe	80	225 GiB	1.3 TiB SSD	$1.29
NVIDIA A10	24	186 GiB	1.3 TiB SSD	$0.75
NVIDIA A6000	48	186 GiB	1.3 TiB SSD	$0.79
NVIDIA Quadro RTX 6000	24	186 GiB	1.3 TiB SSD	$0.59

⚠️ Common Mistakes & Pitfalls

Loss of Accuracy with Naive Quantization: Directly compressing the mathematics in AI models by rounding numbers can lead to significant loss of accuracy, causing the system to output nonsensical results. The fix involves Smart Quantization, where only less sensitive calculations are rounded, preserving critical precision. (02:43)
Slow Inference from Sequential Token Generation: Traditional AI models generate responses token by token (word by word), which is inefficient and slow for longer outputs. The solution is Multi-Token Prediction (MTP), allowing the model to calculate and verify several future tokens (e.g., 7 tokens) simultaneously, drastically speeding up response generation. (03:47)
Inefficient Context Handling (Memory Problem): AI systems often re-read the entire input context for each new token, akin to a student constantly re-reading a textbook. This is memory-intensive and slow. Mamba layers address this by reading the input once and taking highly compressed notes, remembering important details while discarding filler words, leading to efficient processing of massive data. (04:20)
Accumulation of Errors in Recurrent Models: In recurrent generation, small rounding errors from each step can accumulate and magnify over many steps, leading to a drift from the correct answer. Stochastic Rounding (SR) mitigates this by adding carefully crafted random noise that averages to zero over time, ensuring that while individual steps might vary, the overall output remains accurate. (05:10)

Glossary

NVFP4: A specialized numerical format (NVIDIA Floating Point 4) designed to accelerate AI inference by compressing mathematical operations, leading to significantly faster throughput with minimal accuracy loss.
Multi-Token Prediction (MTP): An optimization technique where an AI model predicts and verifies multiple future tokens (words or sub-word units) in parallel, rather than sequentially, to improve inference speed.
Mamba layers: A novel architecture component that improves memory efficiency in AI models by processing input sequences once and storing highly compressed, relevant information, avoiding redundant re-reading of context.
Stochastic Rounding (SR): A quantization technique that introduces controlled random noise during numerical rounding, preventing the systematic accumulation of errors in recurrent neural networks and maintaining overall accuracy over many computational steps.

Key Takeaways

NVIDIA Nemotron 3 Super is a 120-billion parameter open-source AI model, offering transparency and accessibility for AI development.
It achieves performance comparable to closed-source frontier models from a year and a half ago, making advanced AI more widely available.
Innovative techniques like NVFP4 quantization enable up to 7x faster inference without significant accuracy degradation.
Multi-Token Prediction (MTP) significantly boosts generation speed by predicting and verifying multiple tokens concurrently.
Mamba layers enhance memory efficiency by intelligently compressing and retaining context, allowing for efficient processing of large datasets.
Stochastic Rounding addresses error accumulation in recurrent models, ensuring long-term accuracy despite numerical compression.
NVIDIA is reportedly investing billions into open-weight AI models, signaling a shift towards more open and collaborative AI development.
The model can be deployed and experimented with on powerful GPU cloud platforms like Lambda GPU Cloud.

Resources

NVIDIA Nemotron 3 Super Research Paper: [Editor's note: Specific link to the paper is not provided in the video, but it's titled "NVIDIA-Nemotron-3-8B-Instruct: Efficient Micro-Mixture-of-Experts Hybrid Mixture-of-Experts Model for Agile Reasoning"] - Search for the full title on arXiv or NVIDIA's research page.
Lambda GPU Cloud: lambda.ai/papers (for powerful NVIDIA GPU instances to run your own chatbots and experiments)
OpenCode AI: [Editor's note: Official website or GitHub repo for OpenCode AI is not explicitly linked in the video, search for "OpenCode AI" to find it.]

All guides Lire en français →