Extracting financial data from earnings reports
A common task in finance is to extract financial data from earnings reports. Earnings reports are infamously poorly formatted, as the SEC does not have requirements for producing machine-readable documents.
Earnings reports are often provided as HTML documents, which can be difficult to parse. Investors often use complicated parsing systems or manual review to extract data. Entire companies are built around automating this task.
This cookbook is a proof of concept about how we can use LLMs to extract financial data directly into CSV. Comma-separated values are well-structured and can be defined by a regular expression, which Outlines can use to guide the LLM's output.
The example is a smaller subset of a full demo found here. The demo contains the full set of pre-processing steps needed to convert raw HTML into a structured CSV file, and tests the results across three company's 10k reports.
Setup
Install outlines and required dependencies:
# Later versions of torch can have difficulty with certain CUDA drivers.
# We recommend using 2.4.0 for now, but you may wish to experiment with
# other versions.
pip install outlines pandas transformers torch==2.4.0 accelerate
Load the model
Choose your language model. We'll use Phi-3 mini, which is small enough to run on reasonably small machines.
import outlines
import torch
model_name = 'microsoft/Phi-3-mini-4k-instruct'
model = outlines.models.transformers(
model_name,
device='auto',
model_kwargs={
# To reduce memory usage, we'll use bfloat16
"torch_dtype": torch.bfloat16,
},
)
Set up the data
For brevity, we've attached the markdown version of Nvidia's 10k report. The full demonstration processes the raw HTML version of the report to these markdown tables. Pages are filtered by whether they seem to contain income statements, and then compacted into the string you see below.
income_statement = """
Table of ContentsNVIDIA Corporation and SubsidiariesConsolidated Statements of Income(In millions, except per share data)
| | | | | | | | | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| | | | Year Ended | | | | | | | | | | | | | | |
| | | | Jan 28, 2024 | | | | | | Jan 29, 2023 | | | | | | Jan 30, 2022 | | |
| Revenue | | | $ | 60,922 | | | | | $ | 26,974 | | | | | $ | 26,914 | |
| Cost of revenue | | | 16,621 | | | | | | 11,618 | | | | | | 9,439 | | |
| Gross profit | | | 44,301 | | | | | | 15,356 | | | | | | 17,475 | | |
| Operating expenses | | | | | | | | | | | | | | | | | |
| Research and development | | | 8,675 | | | | | | 7,339 | | | | | | 5,268 | | |
| Sales, general and administrative | | | 2,654 | | | | | | 2,440 | | | | | | 2,166 | | |
| Acquisition termination cost | | | | | | | | | 1,353 | | | | | | | | |
| Total operating expenses | | | 11,329 | | | | | | 11,132 | | | | | | 7,434 | | |
| Operating income | | | 32,972 | | | | | | 4,224 | | | | | | 10,041 | | |
| Interest income | | | 866 | | | | | | 267 | | | | | | 29 | | |
| Interest expense | | | (257) | | | | | | (262) | | | | | | (236) | | |
| Other, net | | | 237 | | | | | | (48) | | | | | | 107 | | |
| Other income (expense), net | | | 846 | | | | | | (43) | | | | | | (100) | | |
| Income before income tax | | | 33,818 | | | | | | 4,181 | | | | | | 9,941 | | |
| Income tax expense (benefit) | | | 4,058 | | | | | | (187) | | | | | | 189 | | |
| Net income | | | $ | 29,760 | | | | | $ | 4,368 | | | | | $ | 9,752 | |
| | | | | | | | | | | | | | | | | | |
| Net income per share: | | | | | | | | | | | | | | | | | |
| Basic | | | $ | 12\.05 | | | | | $ | 1\.76 | | | | | $ | 3\.91 | |
| Diluted | | | $ | 11\.93 | | | | | $ | 1\.74 | | | | | $ | 3\.85 | |
| | | | | | | | | | | | | | | | | | |
| Weighted average shares used in per share computation: | | | | | | | | | | | | | | | | | |
| Basic | | | 2,469 | | | | | | 2,487 | | | | | | 2,496 | | |
| Diluted | | | 2,494 | | | | | | 2,507 | | | | | | 2,535 | | |
"""
The markdown tables extracted from the earnings reports can vary widely in row names, column counts, data types, etc. The advantage of LLMs here is that we can define the data we want in terms of the data types, and the LLM will output the data in the desired format.
For comparison, here is how the income statement looks in the original HTML:
Define the data we want
Outlines is often used for JSON output, but it can also be used for CSV. We know the columns we want to extract, and we know the data types of the columns. Year for example is always a four-digit number, revenue is a number with commas, and so on.
We can define a regex pattern for each column type:
# Define the column type regex patterns
column_types = {
# Year is always a four-digit number
"year": r"\d{4}",
# Revenue, operating income, and net income are always numbers with commas.
# This regex permits integers that may begin with a minus sign, and may have
# commas separating the thousands, millions, etc.
"integer_comma": r"((-?\d+),?\d+|(-?\d+))",
# Number is currently not used, but it represents a number with up to two decimal places.
"number": r"(-?\d+(?:\.\d{1,2})?)",
}
Next, let's choose the columns we want to extract. We want
- Year, always a four-digit number
- Revenue, a number with commas
- Operating income, a number with commas
- Net income, a number with commas
# Define the columns to extract, and their data types.
columns_to_extract = {
"year": "year",
"revenue": "integer_comma",
"operating_income": "integer_comma",
"net_income": "integer_comma",
}
You can modify column_type_regex
to match the data types of the columns you want to extract. Adding a new financial metric to extract is as simple as adding a new key/value pair to columns_to_extract
:
Additional columns are not well tested for accuracy, so use with caution.
Create the regex describing the data we want
# Create the header line. This is the requested column names
# separated by commas, i.e. "year,revenue,..."
header = ",".join(columns_to_extract.keys())
# Create the data capture patterns. These are the regex patterns
# that will be used to capture the data in each column
data_patterns = [column_types[dtype] for dtype in columns_to_extract.values()]
data_line = ",".join(data_patterns)
# Our final regex pattern.
max_rows = 3 # We expect 3 rows of data, firms usually report 3 years of income statements
csv_regex = f"{header}(\n{data_line}){{,{max_rows}}}\n\n"
print(csv_regex)
which gives us
year,revenue,operating_income,net_income,basic_earnings_per_share(
\d{4},((-?\d+),?\d+|(-?\d+)),((-?\d+),?\d+|(-?\d+)),((-?\d+),?\d+|(-?\d+)),(-?\d+(?:\.\d{1,2})?)){,3}
Pretty hairy, right? Thankfully, we have a simple function to construct this regex for you. The regex defines a header line, followed by a data line that repeats for each row of data we want to extract. Passing the regex to outlines.generate.regex
will produce a function that will always produce a CSV string that is consistent with the regex.
Prompting the model
Outlines does not add system or instruction tokens by default, so we need to use transformers.AutoTokenizer
to add them for whatever model we're using.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
def add_instruction(prompt):
return tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True)
print(add_instruction("Howdy"))
Our prompt roughly describes the task we want the model to perform, and a few pieces of information it may need to know about income statements.
def extract_financial_data_prompt(columns_to_extract, income_statement):
user_prompt = f"""
Extract annual financial data from this set of pages. Pages
are from a 10k filing and were chosen because they may contain
a comprehensive income statement. Note that selected pages may
be incorrectly extracted, so you should verify that you are extracting
from the comprehensive income statement and not some other financial
statement.
Create a row for each year available in the income statement with the
following columns: {', '.join(columns_to_extract.keys())}. Firms typically report the
most recent 3 years of data, but this can vary.
Each column has types: {', '.join(columns_to_extract.values())}.
# Relevant pages:
{income_statement}
# Key instructions:
1. Look ONLY at the "Consolidated Statements of Income" table
2. For operating income, look for "Income from operations" or "Operating income"
3. For net income, use the TOTAL net income figure, not amounts allocated to specific share classes
4. Use NULL for missing values
5. Operating income must be less than revenue
6. Net income must be less than operating income
7. Ignore segment breakdowns, quarterly data, or per-share amounts
# Output format:
- CSV format with headers: {','.join(columns_to_extract.keys())}
- Use NULL for missing values
- If no data are found, do not create a row.
- Enter two newline characters to terminate the CSV when no more data are found.
# Definitions:
- Revenue: Total sales of goods and services. Usually this is at the top of the
income statement.
- Operating income: Revenue minus operating expenses for the entire company. This is revenue
minus costs. Operating income is also called operating profit, EBIT, or income from
operations.
- Net income: Operating income minus taxes. This is the bottom line of the
income statement.
"""
return add_instruction(user_prompt)
Running the model
Now that we have our prompt and regular expression, we can run the model.
Construct our regex extractor function. We'll use a greedy sampler, which samples the most likely next token at each step. It's a simple sampler that is more reproducible than multinomial sampling.
Provide the prompt to the model and run it:
csv_data = csv_extractor(
extract_financial_data_prompt(columns_to_extract, income_statement),
max_tokens=1024,
)
print(csv_data)
year,revenue,operating_income,net_income
2024,60922,32972,29760
2023,26974,4224,4368
2022,26914,10041,9752
Voila! We've extracted the financial data from the income statement, and it's correct upon inspection.
You can even load this into a pandas
DataFrame for further analysis: