Skip to content

Transformers Vision

Outlines allows seamless use of vision models.

outlines.models.transformers_vision has shares interfaces with, and is based on outlines.models.transformers.

Tasks supported include

  • image + text -> text
  • video + text -> text

Example: Using Llava-Next Vision Models

Install dependencies pip install torchvision pillow flash-attn

Create the model

import outlines
from transformers import LlavaNextForConditionalGeneration

model = outlines.models.transformers_vision(

Create convenience function to load a PIL.Image from URL

from PIL import Image
from io import BytesIO
from urllib.request import urlopen

def img_from_url(url):
    img_byte_stream = BytesIO(urlopen(url).read())

Describing an image

description_generator = outlines.generate.text(model)
    "<image> detailed description:",

This is a color photograph featuring a Siamese cat with striking blue eyes. The cat has a creamy coat and a light eye color, which is typical for the Siamese breed. Its features include elongated ears, a long, thin tail, and a striking coat pattern. The cat is sitting in an indoor setting, possibly on a cat tower or a similar raised platform, which is covered with a beige fabric, providing a comfortable and soft surface for the cat to rest or perch. The surface of the wall behind the cat appears to be a light-colored stucco or plaster.

Multiple Images

To include multiple images in your prompt you simply add more <image> tokens to the prompt

image_urls = [
    "",  # triangle
    "",  # hexagon
description_generator = outlines.generate.text(model)
    "<image><image><image>What shapes are present?",
    list(map(img_from_url, image_urls)),

There are two shapes present. One shape is a hexagon and the other shape is an triangle. '

Classifying an Image

pattern = "Mercury|Venus|Earth|Mars|Saturn|Jupiter|Neptune|Uranus|Pluto"
planet_generator = outlines.generate.regex(model, pattern)

    "What planet is this: <image>",


Extracting Structured Image data

from pydantic import BaseModel
from typing import List, Optional

class ImageData(BaseModel):
    caption: str
    tags_list: List[str]
    object_list: List[str]
    is_photo: bool

image_data_generator = outlines.generate.json(model, ImageData)

    "<image> detailed JSON metadata:",

ImageData(caption='An astronaut on the moon', tags_list=['moon', 'space', 'nasa', 'americanflag'], object_list=['moon', 'moon_surface', 'space_suit', 'americanflag'], is_photo=True)


Chosing a model
