Named entity extraction
Named Entity Extraction is a fundamental problem in NLP. It involves identifying and categorizing named entities within a document: people, organization, dates, places, etc. It is usually the first step in a more complex NLP worklow. Here we will use the example of a pizza restaurant that receives orders via their website and need to identify the number and types of pizzas that are being ordered.
Getting LLMs to output the extracted entities in a structured format can be challenging. In this tutorial we will see how we can use Outlines' JSON-structured generation to extract entities from a document and return them in a valid JSON data structure 100% of the time.
As always, we start with initializing the model. We will be using a quantized version of Mistal-7B-v0.1 (we're GPU poor):
import outlines
model = outlines.models.transformers("TheBloke/Mistral-7B-OpenOrca-AWQ", device="cuda")
And we will be using the following prompt template:
@outlines.prompt
def take_order(order):
"""You are the owner of a pizza parlor. Customers \
send you orders from which you need to extract:
1. The pizza that is ordered
2. The number of pizzas
# EXAMPLE
ORDER: I would like one Margherita pizza
RESULT: {"pizza": "Margherita", "number": 1}
# OUTPUT INSTRUCTIONS
Answer in valid JSON. Here are the different objects relevant for the output:
Order:
pizza (str): name of the pizza
number (int): number of pizzas
Return a valid JSON of type "Order"
# OUTPUT
ORDER: {{ order }}
RESULT: """
We now define our data model using Pydantic:
from enum import Enum
from pydantic import BaseModel
class Pizza(str, Enum):
margherita = "Margherita"
pepperonni = "Pepperoni"
calzone = "Calzone"
class Order(BaseModel):
pizza: Pizza
number: int
We can now define our generator and call it on several incoming orders:
orders = [
"Hi! I would like to order two pepperonni pizzas and would like them in 30mins.",
"Is it possible to get 12 margheritas?"
]
prompts = [take_order(order) for order in orders]
generator = outlines.generate.json(model, Order)
results = generator(prompts)
print(results)
# [Order(pizza=<Pizza.pepperonni: 'Pepperoni'>, number=2),
# Order(pizza=<Pizza.margherita: 'Margherita'>, number=12)]
There are several ways you could improve this example:
- Clients may order several types of pizzas.
- Clients may order drinks as well.
- If the pizza place has a delivery service we need to extract the client's address and phone number
- Clients may specify the time for which they want the pizza. We could then check against a queuing system and reply to them with the estimated delivery time.
How would you change the Pydantic model to account for these use cases?