Transformers Vision
Outlines allows seamless use of vision models.
outlines.models.transformers_vision
shares interfaces with, and is based on outlines.models.transformers.
Tasks supported include
- image + text -> text
- video + text -> text
Example: Using Llava-Next Vision Models
Install dependencies
pip install torchvision pillow flash-attn
Create the model
import outlines
from transformers import LlavaNextForConditionalGeneration
model = outlines.models.transformers_vision(
"llava-hf/llava-v1.6-mistral-7b-hf",
model_class=LlavaNextForConditionalGeneration,
device="cuda",
)
Create convenience function to load a PIL.Image
from URL
from PIL import Image
from io import BytesIO
from urllib.request import urlopen
def img_from_url(url):
img_byte_stream = BytesIO(urlopen(url).read())
return Image.open(img_byte_stream).convert("RGB")
Describing an image
description_generator = outlines.generate.text(model)
description_generator(
"<image> detailed description:",
[img_from_url("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg")]
)
This is a color photograph featuring a Siamese cat with striking blue eyes. The cat has a creamy coat and a light eye color, which is typical for the Siamese breed. Its features include elongated ears, a long, thin tail, and a striking coat pattern. The cat is sitting in an indoor setting, possibly on a cat tower or a similar raised platform, which is covered with a beige fabric, providing a comfortable and soft surface for the cat to rest or perch. The surface of the wall behind the cat appears to be a light-colored stucco or plaster.
Multiple Images
To include multiple images in your prompt you simply add more <image>
tokens to the prompt
image_urls = [
"https://cdn1.byjus.com/wp-content/uploads/2020/08/ShapeArtboard-1-copy-3.png", # triangle
"https://cdn1.byjus.com/wp-content/uploads/2020/08/ShapeArtboard-1-copy-11.png", # hexagon
]
description_generator = outlines.generate.text(model)
description_generator(
"<image><image>What shapes are present?",
list(map(img_from_url, image_urls)),
)
There are two shapes present. One shape is a hexagon and the other shape is an triangle. '
Classifying an Image
pattern = "Mercury|Venus|Earth|Mars|Saturn|Jupiter|Neptune|Uranus|Pluto"
planet_generator = outlines.generate.regex(model, pattern)
planet_generator(
"What planet is this: <image>",
[img_from_url("https://upload.wikimedia.org/wikipedia/commons/e/e3/Saturn_from_Cassini_Orbiter_%282004-10-06%29.jpg")]
)
Saturn
Extracting Structured Image data
from pydantic import BaseModel
from typing import List, Optional
class ImageData(BaseModel):
caption: str
tags_list: List[str]
object_list: List[str]
is_photo: bool
image_data_generator = outlines.generate.json(model, ImageData)
image_data_generator(
"<image> detailed JSON metadata:",
[img_from_url("https://upload.wikimedia.org/wikipedia/commons/9/98/Aldrin_Apollo_11_original.jpg")]
)
ImageData(caption='An astronaut on the moon', tags_list=['moon', 'space', 'nasa', 'americanflag'], object_list=['moon', 'moon_surface', 'space_suit', 'americanflag'], is_photo=True)
Resources
Choosing a model
- https://mmbench.opencompass.org.cn/leaderboard
- https://huggingface.co/spaces/WildVision/vision-arena