All Models
MiMo V2 Omni
MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step planning, tool use, and code execution - making it well-suited for complex real-world tasks that span modalities. 256K context window.
Benchmarks
Available Providers (3)
| Provider | Model ID | Input Cost | Output Cost | Context | Max Output | Docs |
|---|---|---|---|---|---|---|
| | xiaomi/mimo-v2-omni | $0.40/MTok | $2.00/MTok | 265K | 265K | |
| | mimo-v2-omni | $0.40/MTok | $2.00/MTok | 256K | 128K | |
| | xiaomi/mimo-v2-omni | $0.40/MTok | $2.00/MTok | 262.1K | 65.5K |
Capabilities
Reasoning
Tool Calling
Attachments
Open Weights
Structured Output