All Models

qwen/qwen3-vl-8b-instruct

Tool Calling Attachments Open Weights Structured Output

Qwen3-VL-8B-Instruct is a multimodal vision-language model from the Qwen3-VL series, built for high-fidelity understanding and reasoning across text, images, and video. It features improved multimodal fusion with Interleaved-MRoPE for long-horizon temporal reasoning, DeepStack for fine-grained visual-text alignment, and text-timestamp alignment for precise event localization. The model supports a native 256K-token context window, extensible to 1M tokens, and handles both static and dynamic media inputs for tasks like document parsing, visual question answering, spatial reasoning, and GUI control. It achieves text understanding comparable to leading LLMs while expanding OCR coverage to 32 languages and enhancing robustness under varied visual conditions.

Providers 4
Released Oct 15, 2025
Input Modalities text, image, video
Output Modalities text
Tarsk Use coding

Available Providers (4)

Provider Model ID Input Cost Output Cost Context Max Output Docs
NovitaAI qwen/qwen3-vl-8b-instruct $0.08/MTok $0.50/MTok 131.1K 32.8K
Kilo Gateway qwen/qwen3-vl-8b-instruct $0.08/MTok $0.50/MTok 131.1K 32.8K
SiliconFlow Qwen/Qwen3-VL-8B-Instruct $0.18/MTok $0.68/MTok 262K 262K
SiliconFlow (China) Qwen/Qwen3-VL-8B-Instruct $0.18/MTok $0.68/MTok 262K 262K

Capabilities

Reasoning
Tool Calling
Attachments
Open Weights
Structured Output