Models
Overview
Outlines models are objects that wrap an inference client or engine. Models provide a standardized interface to generate structured text.
All Outlines model classes have an associated loader function to facilitate initializing a model instance. The name of this function is from_
plus the name of the model in lower-case letters. For instance, Outlines has a Transformers
model and an associated from_transformers
loader function. The parameters to load a model are specific to each provider, please consult the documentation of the model you want to use for more information.
After having created a model instance, you can either directly call it to generate text or first create a reusable generator that you would then call.
The input you must provide to a model to generate text can be a simple text prompt or a vision or chat input for models that support them. See the model inputs section for more information on model inputs formats.
In all cases, you can provide an output_type
to constrain the format of the generation output. See the output types section for more information on constrained generation.
For instance:
from outlines import from_transformers, Generator
import transformers
# Create a model
model = from_transformers(
transformers.AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
transformers.AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
)
# Call it directly
response = model("How many countries are there in the world", max_new_tokens=20)
print(response) # 'There are 200 countries in the world.'
# Call it directly with an output_type
response = model("How many countries are there in the world", int, max_new_tokens=20)
print(response) # '200'
# Create a generator first and then call it
generator = Generator(model, int)
response = generator("How many countries are there in the world")
print(response) # '200'
Some models support streaming through a stream
method. It takes the same argument as the __call__
method, but returns an iterator instead of a string.
For instance:
from outlines import from_openai, Generator
import openai
# Create the model
model = from_openai(
openai.OpenAI(),
"gpt-4o"
)
# Stream the response
for chunk in model.stream("Tell a short story about a cat.", max_tokens=50):
print(chunk) # 'This...'
Additionally, some models support batch processing through a batch
method. It's similar to the __call__
method, but takes a list of prompts instead of a single prompt and returns a list of strings.
For instance:
from outlines import from_transformers, Generator
import transformers
# Create a model
model = from_transformers(
transformers.AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
transformers.AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
)
# Call it directly
response = model.batch(["What's the capital of Latvia?", "What's the capital of Estonia?"], max_new_tokens=20)
print(response) # ['Riga', 'Tallinn']
Features Matrix
In alphabetical order:
Anthropic | Dottxt | Gemini | LlamaCpp | MLXLM | Ollama | OpenAI | SGLang | TGI | Transformers | Transformers MultiModal | VLLM | VLLMOffline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Output Types | |||||||||||||
Simple Types | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
JSON Schema | ❌ | ✅ | 🟠 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Multiple Choice | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Regex | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Grammar | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | 🟠 | ❌ | ✅ | ✅ | ✅ | ✅ |
Generation Features | |||||||||||||
Async | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ |
Streaming | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ |
Vision | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
Batching | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ✅ |
Model Types
Models can be divided into two categories: local models and server-based models.
In the case of local models, the text generation happens within the inference library object used to instantiate the model. This gives Outlines direct access to the generation process (through a logits processor) and means all structured generation output types are available.
The local models available are the following:
- LlamaCpp
- MLXLM
- Transformers
- TransformersMultiModal
- VLLMOffline
In the case of server-based models, the model is initialized with a client that sends a request to a server that is in charge of the actual text generation. As a result, we have limited control over text generation and some output types are not supported. The server on which the text generation happens can either be remote (with OpenAI or Anthopic for instance) or local (with SGLang for instance).
The server-based models available are the following:
- Anthropic
- Dottxt
- Gemini
- Ollama
- OpenAI
- SgLang
- TGI
- VLLM
Some models have an async version. To use them, just pass the async version of the provider object to their loading function. It will then return a Async<ModelName>
instance with the same methods and features as the regular sync instance.
For instance:
from outlines import from_tgi
from huggingface_hub import AsyncInferenceClient
model = from_tgi(
AsyncInferenceClient("http://localhost:8000/v1")
)
print(type(model)) # outlines.models.tgi.AsyncTGI
The models that have an async version are the following:
- Ollama
- SgLang
- TGI
- VLLM