Models
Outlines supports generation using a number of inference engines (outlines.models). Loading a model using outlines follows a similar interface between inference engines:
import outlines
model = outlines.models.transformers("microsoft/Phi-3-mini-128k-instruct")
model = outlines.models.transformers_vision("llava-hf/llava-v1.6-mistral-7b-hf")
model = outlines.models.vllm("microsoft/Phi-3-mini-128k-instruct")
model = outlines.models.llamacpp(
"microsoft/Phi-3-mini-4k-instruct-gguf", "Phi-3-mini-4k-instruct-q4.gguf"
)
model = outlines.models.exllamav2("bartowski/Phi-3-mini-128k-instruct-exl2")
model = outlines.models.mlxlm("mlx-community/Phi-3-mini-4k-instruct-4bit")
model = outlines.models.openai(
"gpt-4o-mini",
api_key=os.environ["OPENAI_API_KEY"]
)
Feature Matrix
| Transformers | Transformers Vision | vLLM | llama.cpp | ExLlamaV2 | MLXLM | OpenAI* | |
|---|---|---|---|---|---|---|---|
| Device | |||||||
| Cuda | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | N/A |
| Apple Silicon | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | N/A |
| x86 / AMD64 | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | N/A |
| Sampling | |||||||
| Greedy | ✅ | ✅ | ✅ | ✅* | ✅ | ✅ | ❌ |
| Multinomial | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Multiple Samples | ✅ | ✅ | ❌ | ❌ | ✅ | ||
| Beam Search | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ |
| Generation | |||||||
| Batch | ✅ | ✅ | ✅ | ❌ | ? | ❌ | ❌ |
| Stream | ✅ | ❌ | ❌ | ✅ | ? | ✅ | ❌ |
outlines.generate |
|||||||
| Text | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Structured | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| JSON Schema | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Choice | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Regex | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| Grammar | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Caveats
- OpenAI doesn't support structured generation due to limitations in their API and server implementation.
outlines.generate"Structured" includes methods such asoutlines.generate.regex,outlines.generate.json,outlines.generate.cfg, etc.- MLXLM only supports Apple Silicon.
- llama.cpp greedy sampling available via multinomial with
temperature = 0.0.