Vision & Documents

Send images, documents, and videos alongside text for multimodal understanding and analysis.

Overview

Multimodal-capable models can analyze images, documents, and videos alongside text. Yunxin supports multimodal inputs through the Chat Completions API using the content array.

Content Types

Type	Description	Supported Formats
`image_url`	Image analysis	JPEG, PNG, GIF, WebP (URL or base64)
`file_url`	Document analysis	PDF, TXT, CSV, DOCX, and more
`video_url`	Video analysis	MP4, MOV, WebM

Sending Images

Include images in the content array using the image_url type:

{
  "model": "model-id",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/photo.jpg"
          }
        }
      ]
    }
  ]
}

Image Formats

Format	Supported
URL (HTTPS)	Yes
Base64 data URI	Yes
Local file path	No (use base64)

Base64 Encoding

import base64

with open("image.png", "rb") as f:
    base64_image = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="model-id",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}"
                    }
                }
            ]
        }
    ]
)

Multiple Images

Send multiple images in a single request:

response = client.chat.completions.create(
    model="model-id",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two images."},
                {"type": "image_url", "image_url": {"url": "https://example.com/image1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/image2.jpg"}}
            ]
        }
    ]
)

Image Detail Level

Control the resolution for analysis:

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/photo.jpg",
    "detail": "high"
  }
}

Detail	Description	Token Usage
`low`	512×512 fixed	Lower
`high`	Full resolution (up to 2048px)	Higher
`auto`	Model decides	Varies

Vision-Capable Models

Not all models support vision. Use the GET /v1/models endpoint and check for the vision capability to find models that support image inputs.

Document Analysis

Send documents for analysis using the file_url content type:

{
  "model": "model-id",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Summarize the key findings in this document."},
        {"type": "file_url", "file_url": {"url": "https://example.com/report.pdf"}}
      ]
    }
  ]
}

Multiple Documents

response = client.chat.completions.create(
    model="model-id",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare these two documents."},
            {"type": "file_url", "file_url": {"url": "https://example.com/report-q1.pdf"}},
            {"type": "file_url", "file_url": {"url": "https://example.com/report-q2.pdf"}}
        ]
    }]
)

Video Analysis

Send videos for analysis using the video_url content type:

{
  "model": "model-id",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe what happens in this video."},
        {"type": "video_url", "video_url": {"url": "https://example.com/clip.mp4"}}
      ]
    }
  ]
}

Document and video support depends on the model. Use the vision capability as a general indicator for multimodal support, but check individual model documentation for specific format support.

Overview

Multimodal-capable models can analyze images, documents, and videos alongside text. Yunxin supports multimodal inputs through the Chat Completions API using the content array.

Content Types

Type	Description	Supported Formats
`image_url`	Image analysis	JPEG, PNG, GIF, WebP (URL or base64)
`file_url`	Document analysis	PDF, TXT, CSV, DOCX, and more
`video_url`	Video analysis	MP4, MOV, WebM

Sending Images

Include images in the content array using the image_url type:

{
  "model": "model-id",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/photo.jpg"
          }
        }
      ]
    }
  ]
}

Image Formats

Format	Supported
URL (HTTPS)	Yes
Base64 data URI	Yes
Local file path	No (use base64)

Base64 Encoding

import base64

with open("image.png", "rb") as f:
    base64_image = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="model-id",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}"
                    }
                }
            ]
        }
    ]
)

Multiple Images

Send multiple images in a single request:

response = client.chat.completions.create(
    model="model-id",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two images."},
                {"type": "image_url", "image_url": {"url": "https://example.com/image1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/image2.jpg"}}
            ]
        }
    ]
)

Image Detail Level

Control the resolution for analysis:

{
  "type": "image_url",
  "image_url": {
    "url": "https://example.com/photo.jpg",
    "detail": "high"
  }
}

Detail	Description	Token Usage
`low`	512×512 fixed	Lower
`high`	Full resolution (up to 2048px)	Higher
`auto`	Model decides	Varies

Vision-Capable Models

Not all models support vision. Use the GET /v1/models endpoint and check for the vision capability to find models that support image inputs.

Document Analysis

Send documents for analysis using the file_url content type:

{
  "model": "model-id",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Summarize the key findings in this document."},
        {"type": "file_url", "file_url": {"url": "https://example.com/report.pdf"}}
      ]
    }
  ]
}

Multiple Documents

response = client.chat.completions.create(
    model="model-id",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare these two documents."},
            {"type": "file_url", "file_url": {"url": "https://example.com/report-q1.pdf"}},
            {"type": "file_url", "file_url": {"url": "https://example.com/report-q2.pdf"}}
        ]
    }]
)

Video Analysis

Send videos for analysis using the video_url content type:

{
  "model": "model-id",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe what happens in this video."},
        {"type": "video_url", "video_url": {"url": "https://example.com/clip.mp4"}}
      ]
    }
  ]
}

Document and video support depends on the model. Use the vision capability as a general indicator for multimodal support, but check individual model documentation for specific format support.

Overview

Content Types

Sending Images

Image Formats

Base64 Encoding

Multiple Images

Image Detail Level

Vision-Capable Models

Document Analysis

Multiple Documents

Video Analysis

On this page

Vision & Documents

Overview

Content Types

Sending Images

Image Formats

Base64 Encoding

Multiple Images

Image Detail Level

Vision-Capable Models

Document Analysis

Multiple Documents

Video Analysis

On this page