All Models

Llama-3.1-Nemotron-Ultra-253B-v1

llama Reasoning Tool Calling Open Weights Structured Output

Llama-3.1-Nemotron-Ultra-253B-v1 is a large language model (LLM) optimized for advanced reasoning, human-interactive chat, retrieval-augmented generation (RAG), and tool-calling tasks. Derived from Meta’s Llama-3.1-405B-Instruct, it has been significantly customized using Neural Architecture Search (NAS), resulting in enhanced efficiency, reduced memory usage, and improved inference latency. The model supports a context length of up to 128K tokens and can operate efficiently on an 8x NVIDIA H100 node. Note: you must include `detailed thinking on` in the system prompt to enable reasoning. Please see [Usage Recommendations](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1#quick-start-and-usage-recommendations) for more.

Providers 3
Released Jul 1, 2024
Input Modalities text
Output Modalities text
Tarsk Use coding

Available Providers (3)

Provider Model ID Input Cost Output Cost Context Max Output Docs
Nvidia nvidia/llama-3.1-nemotron-ultra-253b-v1 $0.00/MTok $0.00/MTok 131.1K 8.2K
Vultr Llama-3_1-Nemotron-Ultra-253B-v1 $0.55/MTok $1.80/MTok 32K 4.1K
Nebius Token Factory nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 $0.60/MTok $1.80/MTok 128K 4.1K

Capabilities

Reasoning
Tool Calling
Attachments
Open Weights
Structured Output