Vision & Documents
Send images, documents, and videos alongside text for multimodal understanding and analysis.
Overview
Multimodal-capable models can analyze images, documents, and videos alongside text. Yunxin supports multimodal inputs through the Chat Completions API using the content array.
Content Types
| Type | Description | Supported Formats |
|---|---|---|
image_url | Image analysis | JPEG, PNG, GIF, WebP (URL or base64) |
file_url | Document analysis | PDF, TXT, CSV, DOCX, and more |
video_url | Video analysis | MP4, MOV, WebM |
Sending Images
Include images in the content array using the image_url type:
{
"model": "model-id",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/photo.jpg"
}
}
]
}
]
}Image Formats
| Format | Supported |
|---|---|
| URL (HTTPS) | Yes |
| Base64 data URI | Yes |
| Local file path | No (use base64) |
Base64 Encoding
import base64
with open("image.png", "rb") as f:
base64_image = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="model-id",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{base64_image}"
}
}
]
}
]
)Multiple Images
Send multiple images in a single request:
response = client.chat.completions.create(
model="model-id",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two images."},
{"type": "image_url", "image_url": {"url": "https://example.com/image1.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/image2.jpg"}}
]
}
]
)Image Detail Level
Control the resolution for analysis:
{
"type": "image_url",
"image_url": {
"url": "https://example.com/photo.jpg",
"detail": "high"
}
}| Detail | Description | Token Usage |
|---|---|---|
low | 512×512 fixed | Lower |
high | Full resolution (up to 2048px) | Higher |
auto | Model decides | Varies |
Vision-Capable Models
Not all models support vision. Use the GET /v1/models endpoint and check for the vision capability to find models that support image inputs.
Document Analysis
Send documents for analysis using the file_url content type:
{
"model": "model-id",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Summarize the key findings in this document."},
{"type": "file_url", "file_url": {"url": "https://example.com/report.pdf"}}
]
}
]
}Multiple Documents
response = client.chat.completions.create(
model="model-id",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two documents."},
{"type": "file_url", "file_url": {"url": "https://example.com/report-q1.pdf"}},
{"type": "file_url", "file_url": {"url": "https://example.com/report-q2.pdf"}}
]
}]
)Video Analysis
Send videos for analysis using the video_url content type:
{
"model": "model-id",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe what happens in this video."},
{"type": "video_url", "video_url": {"url": "https://example.com/clip.mp4"}}
]
}
]
}Document and video support depends on the model. Use the vision capability as a general indicator for multimodal support, but check individual model documentation for specific format support.
How is this guide?