Capability Overview
The Vision capability supports the model in understanding both images and text simultaneously, allowing for analysis, description, judgment, and question-answering based on image content. Developers can send one or more images to the model in a single request, along with natural language instructions, to complete multimodal understanding tasks. Typical capabilities include:- Image content description (objects, scenes, actions)
- Image question answering (asking questions about the image)
- Comparative analysis and synthesis of multiple images
- Joint reasoning with images + text
Quick Start
Supported Input Formats
Images can be provided to the model in two main ways: by passing the image link or by directly including a base64-encoded image in the request. Images can be included inuser, system, and assistant messages. Currently, images are not supported in the first system message.
Image URL Input (Recommended)
Directly pass an image URL accessible from the public internet, suitable for online business scenarios.Base64 Encoded Image Input
Suitable for local files or private image scenarios. Process Description:- Read the image file locally.
- Convert it to a base64 string.
- Pass it as image content in the request.
Message Structure Example
Images are typically sent alongside text instructions to clarify the model’s understanding objectives.Multiple Image Input
Multiple images can be submitted in a single request, allowing the model to integrate understanding from all images.Image Clarity Control (detail Parameter)
Thedetail parameter can be used to control the level of detail the model applies when processing images:
| Parameter Value | Description |
|---|---|
low | Low resolution, fast speed, low token consumption |
high | High resolution, richer details, high token consumption |
auto | Automatically selects (default) |
- Content understanding / scene judgment:
autoorlow - When detail observation is needed (text, specific parts):
high
Billing and Token Explanation
Visual input will consume additional tokens, which should be considered in cost assessments:lowmode: Each image consumes a fixed 85 tokenshighmode: Token consumption increases based on image size and resolution
- Default to using
auto - Avoid unnecessary
highin bulk or high-concurrency scenarios
Usage Recommendations
- Always provide clear text instructions; do not send images alone.
- Control the number and resolution of images to avoid unnecessary costs.
- Conduct secondary validation for critical business outcomes.
- Use visual understanding as a supplementary capability, not the sole basis for judgment.