All Models

Qwen/Qwen3-VL-32B-Instruct

qwen Tool Calling Attachments Structured Output

Qwen3-VL-32B-Instruct is a large-scale multimodal vision-language model designed for high-precision understanding and reasoning across text, images, and video. With 32 billion parameters, it combines deep visual perception with advanced text comprehension, enabling fine-grained spatial reasoning, document and scene analysis, and long-horizon video understanding.Robust OCR in 32 languages, and enhanced multimodal fusion through Interleaved-MRoPE and DeepStack architectures. Optimized for agentic interaction and visual tool use, Qwen3-VL-32B delivers state-of-the-art performance for complex real-world multimodal tasks.

Providers 3
Released Oct 21, 2025
Input Modalities text, image
Output Modalities text
Tarsk Use coding

Available Providers (3)

Provider Model ID Input Cost Output Cost Context Max Output Docs
Kilo Gateway qwen/qwen3-vl-32b-instruct $0.10/MTok $0.42/MTok 131.1K 32.8K
SiliconFlow Qwen/Qwen3-VL-32B-Instruct $0.20/MTok $0.60/MTok 262K 262K
SiliconFlow (China) Qwen/Qwen3-VL-32B-Instruct $0.20/MTok $0.60/MTok 262K 262K

Capabilities

Reasoning
Tool Calling
Attachments
Open Weights
Structured Output