Models
Overview
Outlines models are objects that wrap an inference client or engine. Models provide a standardized interface to generate structured text.
Warning
The model loading functions have been modified in v1. While they used to be called <name_inference_library>
, they are now called from_<name_inference_library>
. The model classes' names and __init__
methods are left unchanged.
All Outlines model classes have an associated loader function to facilitate initializing a model instance. The name of this function is from_
plus the name of the model in lower-case letters. For instance, Outlines has a Transformers
model and an associated from_transformers
loader function. The parameters to load a model are specific to each provider, please consult the documentation of the model you want to use for more information.
After having created a model instance, you can either directly call it to generate text or first create a reusable generator that you would then call. In either case, you can provide an output_type
to constrain the format of the generation output. See the output types section for more information on constrained generation.
For instance:
from outlines import from_transformers, Generator
import transformers
# Create a model
model = from_transformers(
transformers.AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
transformers.AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct"),
)
# Call it directly
response = model("How many countries are there in the world", max_new_tokens=20)
print(result) # 'There are 200 countries in the world.'
# Call it directly with an output_type
response = model("How many countries are there in the world", int, max_new_tokens=20)
print(result) # '200'
# Create a generator first and then call it
generator = Generator(model, int)
response = generator("How many countries are there in the world")
print(result) # '200'
Features Matrix
In alphabetical order:
Anthropic | Dottxt | Gemini | LlamaCpp | MLXLM | Ollama | OpenAI | SGLang | TGI | Transformers | Transformers MultiModal | VLLM | VLLMOffline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Output Types | |||||||||||||
Simple Types | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
JSON Schema | ❌ | ✅ | 🟠 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Multiple Choice | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Regex | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Grammar | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | 🟠 | ❌ | ✅ | ✅ | ✅ | ✅ |
Generation Features | |||||||||||||
Async | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ |
Streaming | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | ❌ |
Vision | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
Batching | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ |
Model Types
Models can be divided into two categories: local models and server-based models.
In the case of local models, the text generation happens within the inference library object used to instantiate the model. This gives Outlines direct access to the generation process (through a logits processor) and means all structured generation output types are available.
The local models available are the following:
- LlamaCpp
- MLXLM
- Transformers
- TransformersMultiModal
- VLLMOffline
In the case of server-based models, the model is initialized with a client that sends a request to a server that is in charge of the actual text generation. As a result, we have limited control over text generation and some output types are not supported. The server on which the text generation happens can either be remote (with OpenAI or Anthopic for instance) or local (with SGLang for instance).
The server-based models available are the following:
- Anthropic
- Dottxt
- Gemini
- Ollama
- OpenAI
- SgLang
- TGI
- VLLM
Some models have an async version. To use them, just pass the async version of the provider object to their loading function. It will then return a Async<ModelName>
instance with the same methods and features as the regular sync instance.
For instance:
from outlines import from_tgi
from huggingface_hub import AsyncInferenceClient
model = from_tgi(
AsyncInferenceClient("http://localhost:8000/v1")
)
print(type(model)) # outlines.models.tgi.AsyncTGI
The models that have an async version are the following:
- SgLang
- TGI
- VLLM