At Computex 2026 in Taipei, NVIDIA CEO Jensen Huang introduced the NVIDIA Nemotron-3 Ultra. It is a Mixture-of-Experts model with 550 billion parameters. It is large, it is open, and it exists to run multi-step AI reasoning workflows without causing your servers to melt.
The Scale Problem in Agentic AI
Running autonomous AI agents is a good way to exhaust your compute budget. Traditional AI models require a lot of memory and processing power to perform multi-step tasks. Every extra step added to a workflow increases the probability that your system will either run out of memory or take a very long time to return a response.
Standard Transformer architectures also scale quadratically with context length. If you feed an agent a large codebase, a long technical manual, or a complicated database schema, the model will consume a significant amount of compute resources. Historically, developers had to choose between models that are smart and models that are fast.
The official weights will be released on June 4, 2026, on Hugging Face, ModelScope, and OpenRouter, giving developers direct, open access to a large reasoning engine.
Introducing NVIDIA Nemotron-3 Ultra
The NVIDIA Nemotron-3 Ultra is the largest member of the Nemotron-3 family. The family also contains the edge-optimized Nano model and the mid-range Super model.
The Ultra variant contains approximately 550 billion total parameters. To prevent the model from requiring the electrical output of a small power plant, it uses a Mixture-of-Experts (MoE) routing gate. This gate ensures that only 55 billion parameters are actually active for any given token. NVIDIA claims that this architecture offers up to 5x higher throughput and up to 30% lower operational costs.
Architectural Breakthroughs: LatentMoE and Multi-Token Prediction
NVIDIA modified the standard Transformer structure with three main architectural adjustments to improve performance:
- LatentMoE (Latent Mixture of Experts) — compressing tokens into a low-rank latent space before routing allows the model to access four times as many expert specialists without increasing the memory bandwidth requirement.
- Multi-Token Prediction (MTP) — predicting multiple future tokens simultaneously improves the coherence of reasoning and enables built-in speculative decoding.
- Mamba-Transformer Hybrid Structure — combining traditional attention layers with Mamba-2 state-space layers allows the model to process a 1-million-token context window in linear time.
Optimized for the Blackwell GPU Architecture
The model was co-designed to run on NVIDIA Blackwell GPUs. It uses the new NVFP4 (4-bit floating point) precision format.
Using 4-bit precision reduces the memory footprint significantly. It allows datacenters to run a 550-billion-parameter model on fewer GPUs than previously required. This hardware integration is useful for organizations that prefer to spend less money on hardware procurement.
Enterprise Implementation
Some organizations are already testing Nemotron-3 Ultra for enterprise applications:
- CrowdStrike — testing the model to run long-horizon security analysis and automated threat-detection agents.
- Palantir — deploying the model inside Palantir AIP to coordinate multi-system operational decisions.
Key Takeaways
Build Smarter Agents with Tarsk
Get ready to integrate open frontier models like Nemotron-3 Ultra into your autonomous pipelines. Connect your favorite providers and model endpoints today.