Skip to content

Transformers

Installation

You need to install the transformers library to be able to use the Transformers in Outlines. Install all optional dependencies of the Transformers model with: pip install outlines[transformers].

See the HuggingFace documentation for more information on installing transformers with CPU, GPU...

Model Initialization

To load the model, you can use the from_transformers function. It takes 2 arguments:

  • model: a transformers model (created with AutoModelForCausalLM for instance)
  • tokenizer_or_processor: a transformers tokenizer (created with AutoTokenizer for instance, it must be an instance of either PreTrainedTokenizer or PreTrainedTokenizerFast)

For instance:

import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer

# Create the transformers model and tokenizer
hf_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
hf_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# Create the Outlines model
model = outlines.from_transformers(hf_model, hf_tokenizer)

If you provide a processor instead of a tokenizer for the second argument of the from_transformers function, you would get a TransformersMultiModal instance. See the TransformersMultiModal model documentation for more information on using multimodal models in Outlines.

Text Generation

To generate text, you can simply call the model with a prompt.

For instance:

import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer

# Create model
model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
    AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
)

# Call it to generate text
result = model("What's the capital of Latvia?", max_new_tokens=20)
print(result) # 'Riga'

The Transformers model also supports batch generation. To use it, provide a list of prompts instead of a single prompt. You will receive as result a list of completions. For instance:

import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer

# Create model
model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
    AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
)

# Create a list of prompts that will be used in a single batch
prompts = [
    "What's the capital of Lithuania?",
    "What's the capital of Latvia?",
    "What's the capital of Estonia?"
]

# Call it to generate text
result = model(prompts, max_new_tokens=20)
print(result) # ['Vilnius', 'Riga', 'Tallinn']

Structured Generation

As a local model, Transformers supports all output types available in Outlines. Simply provide an output_type after the prompt when calling the model.

Simple Type

import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer

output_type = int

model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
    AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
)

result = model("How many countries are there in the world?", output_type, max_new_tokens=5)
print(result) # '200'

JSON Schema

import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer
from pydantic import BaseModel
from typing import List

class Character(BaseModel):
    name: str
    age: int
    skills: List[str]

model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
    AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
)

result = model("Create a character.", output_type=Character, max_new_tokens=200, repetition_penalty=0.5)
print(result) # '{"name": "Evelyn", "age": 34, "skills": ["archery", "stealth", "alchemy"]}'
print(Character.model_validate_json(result)) # name=Evelyn, age=34, skills=['archery', 'stealth', 'alchemy']

Multiple Choice

from typing import Literal
import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer

output_type = Literal["Paris", "London", "Rome", "Berlin"]

model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
    AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
)

result = model("What is the capital of France?", output_type, max_new_tokens=10, temperature=0)
print(result) # 'Paris'

Regex

import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer
from outlines.types import Regex

output_type = Regex(r"\d{3}-\d{2}-\d{4}")

model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
    AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
)

result = model("Generate a fake social security number.", output_type, max_new_tokens=20, top_p=0.5)
print(result) # '782-32-3789'

Context-Free Grammar

import outlines
from transformers import AutoModelForCausalLM, AutoTokenizer
from outlines.types import CFG

arithmetic_grammar = """
?start: sum

?sum: product
| sum "+" product   -> add
| sum "-" product   -> sub

?product: atom
| product "*" atom  -> mul
| product "/" atom  -> div

?atom: NUMBER           -> number
| "-" atom         -> neg
| "(" sum ")"

%import common.NUMBER
%import common.WS_INLINE

%ignore WS_INLINE
"""
output_type = CFG(arithmetic_grammar)

model = outlines.from_transformers(
    AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
    AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
)

result = model("Write an addition.", output_type, max_new_tokens=100)
print(result) # '23 + 48'

Inference Arguments

When calling the model, you can provide optional inference parameters on top of the prompt and the output type. These parameters will be passed on to the generate method of the transformers model. Some common inference arguments include max_new_tokens, temperature, repetition_penalty and top_p.

See the transformers documentation for more information on inference parameters.

Warning

The max_new_tokens inference parameter has a default value of 20. This is insufficient for most tasks and will result in the generation output not respecting the output type (because the response is truncated). We recommend you always provide a value for this argument.