Financial Statement Analysis with Large Language Models

5757 S. University Ave.

Chicago, IL 60637

Main: 773.702.5599

bﬁ.uchicago.edu

WORKING PAPER · NO. 2024-65

Financial Statement Analysis with Large

Language Models

Alex G. Kim, Maximilian Muhn, and Valeri V. Nikolaev

MAY 2024

Financial Statement Analysis with

Large Language Models

Alex G. Kim

Maximilian Muhn

Valeri V. Nikolaev

This draft: May 20, 2024

Abstract

We investigate whether an LLM can successfully perform ﬁnancial statement analy-

sis in a way similar to a professional human analyst. We provide standardized and

anonymous ﬁnancial statements to GPT4 and instruct the model to analyze them to

determine the direction of future earnings. Even without any narrative or industry-

speciﬁc information, the LLM outperforms ﬁnancial analysts in its ability to predict

earnings changes. The LLM exhibits a relative advantage over human analysts in sit-

uations when the analysts tend to struggle. Furthermore, we ﬁnd that the prediction

accuracy of the LLM is on par with the performance of a narrowly trained state-of-

the-art ML model. LLM prediction does not stem from its training memory. Instead,

we ﬁnd that the LLM generates useful narrative insights about a company’s future

performance. Lastly, our trading strategies based on GPT’s predictions yield a higher

Sharpe ratio and alphas than strategies based on other models. Taken together, our

results suggest that LLMs may take a central role in decision-making.

Keywords: Financial statement analysis, Large language models, GPT4, chain-of-

thought, neural network, asset pricing, earnings, direction of earnings changes, analysts

JEL Codes: G12, G14, G41, M41

Companion App: To showcase the capabilities of LLMs for ﬁnancial statement anal-

ysis, we created an interactive Companion App.

The University of Chicago, Booth School of Business, alex.kim@chicagobooth.edu

The University of Chicago, Booth School of Business, maximilian.muhn@chicagobooth.edu

The University of Chicago, Booth School of Business, valeri.nikolaev@chicagobooth.edu

We appreciate insightful comments from Bok Baik, Mark Bradshaw, Joachim Gassen, Ralph Koijen,

Laurence van Lent, Christian Leuz, and Sanjog Misra, and workshop participants at the Bernstein Quanti-

tative Finance Conference, University of Chicago, University of North Carolina at Chapel Hill, 2024 Tuck

Accounting Spring Camp at Dartmouth, and Korean-American Accounting Professors’ Association. Yijing

Zhang provided excellent research assistance. The authors gratefully acknowledge ﬁnancial support from the

University of Chicago Research Support Center, Fama-Miller Center for Finance Research, and the Stevens

Doctoral Program at the University of Chicago Booth School of Business.

This app requires ChatGPT Plus subscription and relies on a diﬀerent prompt that integrates narrative

context while processing 10-Ks and 10-Qs step-by-step. The downside of this functionality is that it is more

prone to retrieval errors and the accuracy of information must be veriﬁed.

Financial Statement Analysis with Large Language Models 1

1 Introduction

Can large language models (LLMs) make informed ﬁnancial decisions or are they simply

a support tool? Their advanced capabilities to analyze, interpret, and generate text enable

LLMs to excel across a wide range of tasks, including summarization of complex disclosures,

sentiment analysis, information extraction, report generation, compliance veriﬁcation, etc.

(e.g., Bernard et al., 2023; Bybee, 2023; Choi and Kim, 2023; Kim et al., 2023a,b; Lopez-Lira

and Tang, 2023). All these tasks, however, involve the textual domain and require specialized

training or ﬁne-tuning of the model.

The boundaries of this disruptive technology outside

of the textual domain and with respect to more general tasks that require numeric analysis

and judgment are yet to be understood. We probe these boundaries in the ﬁnancial analysis

domain.

We study whether an LLM can successfully perform ﬁnancial statement analysis in a

way similar to what professional human analysts do. The answer to this question has far-

reaching implications for the future of ﬁnancial analysis and whether ﬁnancial analysts will

continue to be the backbone of informed decision-making in ﬁnancial markets. The answer

is far from obvious, given that an LLM lacks the deep understanding of the ﬁnancials of a

company that a human expert would have. Further, one of the most challenging domains for a

language model is the numerical domain, where the model needs to carry out computations,

perform human-like interpretations, and make complex judgments (Brown et al., 2020).

While LLMs are eﬀective at textual tasks, their understanding of numbers typically comes

from the narrative context and they lack deep numerical reasoning or the ﬂexibility of a

human mind.

Financial statement analysis (FSA), sometimes referred to as fundamental analysis, is

a particularly useful setting to examine the role of LLMs in future decision-making. Tra-

ditionally, ﬁnancial statement analysis is performed by ﬁnancial analysts and investment

professionals with the primary objective to understand the ﬁnancial health of a company

and determine whether its performance is sustainable. Unlike a typical task performed by

an LLM, FSA is a quantitative task that involves analyzing trends and ratios. At the same

time, it also requires critical thinking, reasoning, and ultimately, complex judgments. Im-

portantly, unlike in other applications, such as answering bar or CPA exam questions (Choi

et al., 2022; Eulerich et al., 2023), an LLM cannot rely on its memory for the correct answer.

Our research design involves passing a balance sheet and income statement in a stan-

dardized form to the large language model, GPT 4.0 Turbo, and asking the model to analyze

For example, to be able to eﬃciently summarize texts, an LLM is trained on a large corpus of documents

that involve summaries typically generated by humans

Financial Statement Analysis with Large Language Models 2

them. In particular, based on the analysis of the two ﬁnancial statements, the model must

decide whether a ﬁrm’s economic performance is sustainable and, more speciﬁcally, whether

a company’s earnings will grow or decline in the following period. We focus on earnings

because they are the primary variable forecasted by ﬁnancial analysts and fundamental for

valuation (Penman and Sougiannis, 1998; Penman, 2001; Monahan et al., 2018).

A key research design choice that we make is to not provide any textual information

(e.g., Management Discussion and Analysis) that typically accompanies ﬁnancial statements.

While textual information is easy to integrate, our primary interest lies in understanding the

LLMs’ ability to analyze and synthesize purely ﬁnancial numbers. We use this setup to

examine several research questions.

First, can a large language model generate economic insights purely from the numbers

reported in ﬁnancial statements absent any narrative context? How does an LLM’s per-

formance compare to that of human analysts and do they add incremental value? Can

the model’s performance be enhanced via instructions that emulate steps typically followed

by ﬁnancial analysts? How does LLM’s performance compare to other benchmarks, such

as logistic regression and a state-of-the-art ANN design, and whether the model can oﬀer

additional insights?

Conceptually, an LLM can add value relative to a human analyst due to its ability

to quickly analyze large quantities of unstructured data and a vast knowledge base that

enables to model to recognize patterns, e.g., familiar business situations, in the data. It is

not obvious, however, that these considerations are particularly relevant for the present task.

In fact, there are a number of reasons to expect that professional analysts will outperform a

machine-based approach to ﬁnancial statement analysis. First, ﬁnancial statement analysis

is a complex and loosely deﬁned task that involves ambiguity and requires common sense,

intuition, and ﬂexibility of the human mind. Second, it requires reasoning and judgment that

machines presently lack. Finally, it necessitates a broader understanding of the industry and

macro-economy.

When compared to a narrowly specialized ML application, such as an artiﬁcial neural

net (ANN) trained for earnings prediction, an LLM also appears to be at a serious disadvan-

tage. Training a specialized ANN allows the model to learn deep interactions that contain

important cues that cannot be easily gathered by the general-purpose model without pro-

viding additional insights or context. Nevertheless, an LLM’s advantage potentially lies in

its vast knowledge and general understanding of the world, such as business concepts and

investment theories that allow it to emulate deductive reasoning performed by humans. This

could include intuitive reasoning and forming hypotheses based on incomplete information

or previously unseen scenarios.

Financial Statement Analysis with Large Language Models 3

Our approach to testing an LLM’s performance involves two steps. First, we anonymize

and standardize corporate ﬁnancial statements to prevent the potential memory of the com-

pany by the language model. In particular, we omit company names from the balance sheet

and income statement and replace years with labels, such as t, and t − 1. Further, we

standardize the format of the balance sheet and income statement in a way that follows

Compustat’s balancing model. This approach ensures that the format of ﬁnancial state-

ments is identical across all ﬁrm-years so that the model does not know what company or

even time period its analysis corresponds to.

In the second stage, we design prompts that instruct the model to perform ﬁnancial

statement analysis and, subsequently, to determine the direction of future earnings.

addition to a simple prompt, we develop a Chain-of-Thought (CoT) prompt that eﬀectively

“teaches” the model to mimic a ﬁnancial analyst.

In particular, as a part of their analysis,

ﬁnancial analysts identify notable trends in ﬁnancial statement line items, compute key

ﬁnancial ratios (e.g., operating eﬃciency, liquidity, and (or) leverage ratio), synthesize this

information, and form expectations about future earnings (Bouwman et al., 1987). Our

CoT prompt implements this thought process via a set of instructions ultimately making

a determination of whether next year’s earnings will increase or decrease compared to the

current year.

We test the model’s performance using the Compustat universe and, when necessary,

intersect it with the IBES universe. The full sample spans the 1968-2021 period and includes

150,678 ﬁrm-year observations from 15,401 distinct ﬁrms. The analyst sample spans the

1983-2021 period with 39,533 observations from 3,152 distinct ﬁrms. Our target variable

across all models is a directional change in future earnings. To evaluate analysts’ prediction

accuracy, we compute consensus forecasts (the median of individual analyst forecasts issued

in the month following the release of ﬁnancial statements) and use them as an expectation for

the following year’s earnings. This ensures the comparability of analysts’ forecasts and model

prediction results.

In addition, we also use three-month and six-month ahead consensus

forecasts as alternative expectation benchmarks. These benchmarks disadvantage the LLM

as they incorporate the information acquired during the year. However, because analysts

Focusing on predicting the direction of future earnings provides a speciﬁc and measurable objective,

facilitating the benchmarking of the model’s performance. It is also consistent with early and more recent

literature on this topic (e.g., Ou and Penman, 1989; Chen et al., 2022). Additionally, the focus on a binary

variable is also motivated by the notion that most key decisions performed by humans are binary in nature

(e.g., Kahneman, 2011).

Chain-of-thought prompts are known to enhance the model’s problem-solving capability and induce

human-like reasoning (Wei et al., 2022).

Since the quantitative models use only ﬁnancial statement variables, we thus align the timing of human

forecasts with the timing of AI-based forecasts.

Financial Statement Analysis with Large Language Models 4

may be sluggish to incorporate new information into their forecasts, we report them for

comparison purposes.

We start by analyzing GPT’s performance compared to security analysts in predicting the

direction of future earnings (Ou and Penman, 1989). At the outset, we note that predicting

changes in EPS is a highly complex task as the EPS time series are approximated by a random

walk and contain a large unpredictable component. We ﬁnd that the ﬁrst-month analysts’

forecasts achieve an accuracy of 53% in predicting the direction of future earnings, which

dominates the 49% accuracy of a naive model that extrapolates the prior year’s change.

Three- and six-month ahead forecasts achieve a meaningfully higher accuracy of 56% and

57% respectively, which is intuitive given that they incorporate more timely information.

A “simple” non-CoT prompt GPT-based forecasts achieve a performance of 52%, which

is lower compared to the analyst benchmarks, which is in line with our prior. However, when

we use the chain of thought prompt to emulate human reasoning, we ﬁnd that GPT achieves

an accuracy of 60%, which is remarkably higher than that achieved by the analysts. Similar

conclusions follow if we examine the F1-score, which is an alternative metric to evaluate a

model’s forecasting ability (based on a combination of its precision and recall). This implies

that GPT comfortably dominates the performance of a median ﬁnancial analyst in analyzing

ﬁnancial statements to determine the direction a company is moving in.

We probe deeper to understand the strengths and weaknesses of humans relative to an

LLM. Intuitively, human analysts may rely on soft information or a broader context not

available to the model and thus add value (Costello et al., 2020; Liu, 2022). Indeed, we

ﬁnd that analysts’ forecasts contain useful insights about future performance not captured

by GPT. Furthermore, we show that when humans struggle to come up with the future

forecast, GPT’s insights are more valuable. Similarly, in the instances where human forecasts

are prone to biases or ineﬃciency (i.e., not incorporating information rationally), GPT’s

forecasts are more useful in predicting the direction of future earnings.

As human forecasts are known to exhibit statistical biases (Abarbanell and Bernard, 1992;

Basu and Markov, 2004), it is also interesting to examine GPT’s performance relative to

specialized ML applications trained speciﬁcally to predict earnings based on a large dataset.

We examine three such forecasting models. The ﬁrst model follows Ou and Penman (1989)

and relies on a stepwise logistic regression model with 59 predictors.

Our second model

is an artiﬁcial neural network (ANN) that uses the same 59 predictors but also leverages

This ﬁnding is consistent with Bradshaw et al. (2012), who show that analysts are superior in predicting

one-year ahead earnings.

We exclude predictors that rely on stock prices, in particular the P/E ratio, because balance sheets

and income statements do not contain stock price information. This exclusion ensures comparability of our

benchmark. The results are qualitatively similar, however, if we include this variable.

Financial Statement Analysis with Large Language Models 5

non-linearities and interactions among them. Third, to ensure consistency between GPT and

ANN, we also use the ANN model trained on the same information set (the income statement

and balance sheet) that we provide to GPT. Importantly, we train these models each year

based on ﬁve years of historical data using a population of observations on Compustat. All

forecasts are out of sample.

Using the entire Compustat sample, we ﬁnd that the stepwise logistic regression achieves

an accuracy (F1-score) of 52.94% (57.23%), which is on par with human analysts and consis-

tent with the prior literature (Ou and Penman, 1989; Hunt et al., 2022). In contrast, ANN

trained on the same data achieves a much higher accuracy of 60.45% (F1-score 61.62), which

is in the range of the state-of-the-art earnings prediction models. When we use GPT CoT

forecasts, we observe that the model achieves an accuracy of 60.31% on the entire sample,

which is very similar to the ANN’s accuracy. In fact, GPT exhibits a meaningfully higher F1

score compared to the ANN (63.45% vs. 61.6%). When we train the ANN exclusively using

the data from the two ﬁnancial statements (fed into GPT), which is a smaller information

set, we ﬁnd that ANN’s predictive ability is slightly lower, with an accuracy (F1-score) of

59.02% (60.66%), compared to GPT’s performance. Overall, these ﬁndings suggest that

GPT’s accuracy is on par (or even slightly higher) than the accuracy of narrowly specialized

state-of-the-art machine learning applications. This is a somewhat surprising result because

specialized models are trained to leverage information most eﬃciently. It indicates a remark-

able aptitude a pre-trained large language model possesses to analyze ﬁnancial statements

and even more so given that we do not provide any textual disclosures, such as MD&A.

We further observe that ANN’s and GPT’s predictions are complementary in that both

of them contain useful incremental information with some indication that GPT tends to

do well when ANN struggles. In particular, ANN predicts earnings based on the training

examples it saw in the past data, and given that many of the examples are complex and

highly multidimensional, its learning capacity may be limited. In contrast, GPT makes

relatively fewer mistakes when predicting the earnings of small or loss-making companies,

likely beneﬁting from its human-like reasoning and extensive knowledge. This ability to draw

upon a broader range of knowledge provides a distinct advantage for the language model.

We perform several additional experiments partitioning the samples based on GPT’s

conﬁdence in its answers, and using diﬀerent families of LLMs. When GPT answers with

higher conﬁdence, the forecasts tend to be more accurate than less conﬁdent forecasts. We

also ﬁnd that the earlier version, GPT3.5, shows considerably less impressive performance,

suggesting that our main results should not be taken for granted. At the same time, we show

We use ﬁve years to allow the model’s parameters to change over time, which helps to ensure accuracy.

We also experimented with longer windows and found similar results

Financial Statement Analysis with Large Language Models 6

that the results generalize to other LLMs. In particular, Gemini Pro, recently released by

Google, achieves a similar level of accuracy compared to GPT 4.

Given the documented consistently impressive LLM’s performance in fundamental anal-

ysis, it is interesting to understand why the model is so successful. We examine two broad

hypotheses. The ﬁrst hypothesis is that GPT’s performance is driven by its (possibly near-

perfect) memory. It would be especially problematic if GPT could somehow infer the com-

pany’s identity and year from the data and match this information with the sentiment about

this company learned from newspaper articles or press releases. We aim to rule out this

hypothesis (see Section 6.1). Furthermore, we replicate our results using the most recent

year of data, which lies outside GPT4’s training period (i.e., pure out-of-sample tests).

Our second hypothesis is that GPT generates useful insights based on which the model

infers the direction of future earnings. For example, we observe that the model frequently

computes standard ratios computed by ﬁnancial analysts and, as instructed by CoT prompt,

generates narratives that analyze these ratios. To test this, we pool all narratives generated

by the model for a given ﬁrm-year and encode them into 768-dimensional vectors (embed-

dings) using BERT. We then feed these vectors into an ANN and train it to predict the

direction of future earnings. We ﬁnd that the ANN trained on the GPT’s narrative insights

achieves an accuracy of 59%, which is almost as high as the GPT forecast accuracy (60%).

In fact, the embedding-based ANN achieves an F1-score that is higher than GPT’s (65%

vs. 63%). This result presents direct evidence that the narrative insights generated by the

model are informative about future performance. Further, we observe a 94% correlation

between GPT’s forecasts and ANN forecasts based on the GPT’s narratives, suggesting that

the information encoded by these narratives is the basis for GPT’s forecasts. We also ﬁnd

that narratives related to ratio analysis, in particular, are most important in explaining

the direction of future earnings. In sum, the narratives derived from CoT reasoning are

responsible for the model’s superior performance.

Finally, we explore the economic usefulness of GPT’s forecasts by analyzing their value

in predicting stock price movements. We ﬁnd that the long-short strategy based on GPT

forecasts outperforms the market and generates signiﬁcant alphas and Sharpe ratios. For

example, alpha in the Fama-French three-factor model exceeds 12% per year. GPT stands

out for doing particularly well in predicting the returns for small companies, as compared to

ANN-based strategies.

We make several contributions to the literature. First, to the best of our knowledge, we

This ﬁnding aligns with our earlier result that GPT is relatively better in predicting earnings for smaller

companies compared to ANNs. Given that GPT’s training dataset likely contained a disproportionate

amount of information from larger ﬁrms, this result further challenges the notion that GPT’s performance

is merely a function of its memory.

Financial Statement Analysis with Large Language Models 7

are the ﬁrst to provide large-scale evidence on LLM’s ability to analyze ﬁnancial statements

– a complex task that is traditionally performed by human analysts. We show that an LLM

can generate state-of-the-art inferences about the direction of the company, outperforming

ﬁnancial analysts and prior models and generating valuable insights along the way. Im-

portantly, we show that the language model can successfully analyze numbers in ﬁnancial

statements without any narrative context.

Second, our results provide evidence on the limits of LLMs. In particular, the boundaries

of generative AI to successfully perform tasks outside of their native domain are not well

understood. We ﬁnd that an LLM excels in a quantitative task that requires intuition

and human-like reasoning. The ability to perform tasks across domains points towards the

emergence of Artiﬁcial General Intelligence. Broadly, our analysis suggests that LLMs can

take a more central place in decision-making than what is previously thought.

Third, we contribute to the literature on fundamental analysis. Starting from Ou and

Penman (1989), there is a large literature in accounting that focuses on earnings prediction

based on accounting fundamentals (for example, Bochkay and Levine, 2019; Hunt et al.,

2022; Chen et al., 2022). In particular, Chen et al. (2022) predict the direction of earnings

changes using tree-based machine learning models trained on over 12,000 exploratory vari-

ables based on ﬁrms’ XBRL tags.

We use a novel approach to analyze ﬁnancial information

to derive insights about future performance. In particular, we show that an LLM-based

ﬁnancial statement analysis, by drawing on vast knowledge and chain-of-thought reasoning,

complements humans as well as specialized models in generating value-relevant information.

In that sense, we also contribute to the recent literature on the relative advantage of humans

versus AI in ﬁnancial markets (Costello et al., 2020; Liu, 2022; Cao et al., 2024).

2 Conceptual Underpinnings

Financial statement analysis, or fundamental analysis, has long been considered of crit-

ical importance for informed decision-making (e.g., Graham and Dodd, 1934). It uses the

numbers reported in ﬁnancial statements to gain insights into the ﬁnancial health of the

company, aiming to reveal information about a ﬁrm’s future prospects and valuation (Ou

and Penman, 1989; Piotroski, 2000; Sloan, 2019).

Financial statement analysis underlies the work performed by ﬁnancial analysts, who

play a pivotal role in ﬁnancial markets.

One of their primary tasks involves predicting

The observed variation in prediction accuracy relative to Chen et al. (2022) can be attributed to the

considerably fewer predictive variables included in our sample. Additionally, when our analysis is conﬁned

to ﬁrms examined in Chen et al. (2022), the prediction accuracy of GPT notably increases to 64%.

Analysts are often formally trained in ﬁnancial statement analysis. For example, ﬁnancial statement

Financial Statement Analysis with Large Language Models 8

ﬁrms’ earnings, which serves both as an input in their own stock market recommendations

and an output that informs investors (Stickel, 1991; Brown et al., 2015). When making

earnings forecasts, their work typically begins with a systematic analysis of ﬁnancial state-

ments (Bouwman et al., 1987), often using standardized templates to ensure consistency and

accuracy. This analysis enables ﬁnancial analysts to establish a baseline understanding of

a company’s ﬁnancial position and performance, assessing factors such as operating perfor-

mance or capital structure. They then contextualize this ﬁnancial data by drawing upon

their industry and private knowledge about the ﬁrm before issuing their forecasts (Brown

et al., 2015). The accuracy and quality of these forecasts not only drive market percep-

tions but also are fundamental to analysts’ career advancement and job security (Basu and

Markov, 2004; Groysberg et al., 2011).

Prior research generally concludes that sell-side analysts outperform time series models in

terms of producing credible annual earnings forecasts (e.g., Bradshaw, 2011). Consequently,

these forecasts are frequently used as a proxy for markets’ earnings expectations. At the same

time, prior research has shown that ﬁnancial analysts produce potentially erroneous or biased

estimates (Bradshaw, 2011; Kothari et al., 2016). For example, Green et al. (2016) show that

analysts make technical errors and questionable economic judgments when evaluating ﬁrms

with quantitative methods. Evidence from De Bondt and Thaler (1990) or Bordalo et al.

(2019) suggest that ﬁnancial analysts overreact to recent events. These mistakes and biases

highlight the complexity of processing information eﬃciently when large volumes of data are

involved.

Recognizing these challenges in conventional ﬁnancial forecasting and human information

processing, general-purpose language models, such as ChatGPT, hold promise in facilitat-

ing ﬁnancial statement analysis and the associated tasks such as earnings forecasting and

decision-making more generally. These advanced AI systems are noted for their expansive

knowledge across various domains and ability to quickly and eﬃciently process large quanti-

ties of data (Achiam et al., 2023). For example, their proﬁciency extends to answering CFA

or CPA exam questions (Eulerich et al., 2023), demonstrating their ﬁnancial knowledge and

potential for understanding theories. In a similar vein, prior literature has shown that these

models are capable of eﬃciently processing large sets of ﬁnancial data (e.g., Kim et al.,

2023b,a). LLMs have also shown promise in predicting certain economic outcomes. Lopez-

Lira and Tang (2023) and Jiang et al. (2022) show that GPT can explain short-term stock

returns based on newspaper headlines and Bybee (2023) ﬁnds that GPT’s macroeconomic

prediction aligns well with the expert survey results. In addition, Hansen and Kazinnik

(2023) document that GPT can understand the political stance of FOMC announcements

analysis is a major part of the Level I CFA exam.

Financial Statement Analysis with Large Language Models 9

and relate it to future macroeconomic shocks.

However, despite the successes of large language models in many tasks, they are pri-

marily viewed as a support tool and their ability to act autonomously to perform ﬁnancial

statement analysis at a level of a human analyst faces signiﬁcant challenges. First, ﬁnancial

statement analysis is a broad task that is more of an art than science, whereas machines

typically excel in narrow, well-deﬁned tasks. It requires common sense, intuition, ability

to reason and make judgements, ability to handle situations unseen previously. Second,

LLM is not trained to analyze ﬁnancial information, e.g., in the same way they are trained

to summarize text or answer questions. In fact, inputs into the tasks performed by LLMs

have been predominantly qualitative and language-based, and, LLMs have struggled with

understanding numeric domain (Brown et al., 2020). Third, humans are more capable of

incorporating their knowledge of broader context – something a machine often cannot do –

by taking into account soft information, knowledge of the industry, regulatory, political, and

macroeconomic factors. These factors stack up against the odds that an LLM can achieve a

human like performance in analyzing ﬁnancial statements.

An alternative to utilizing a general-purpose large language model for ﬁnancial statement

analysis involves specifying a more narrow objective, such as earnings prediction, and training

a specialized ML model, such as Artiﬁcial Neural Network (ANN), to perform this task.

Unlike the general-purpose large language models, which are trained to predict the next word

in a textual sequence, ANNs learn deep interactions among a large number of predictors to

deliver powerful forecasts of the target variable.

Because LLMs are not trained to uncover

these complex relationships among predictors, they are fundamentally disadvantaged relative

to the specialized models in a speciﬁc prediction task. Nevertheless, the eﬀectiveness of these

ANNs can be limited if they encounter patterns not observed during training with suﬃcient

frequency. This is where theoretical knowledge or general understanding of how the world

works becomes essential, as does the value of human experience, intuition, and judgment.

This grants possibly an important advantage to an LLM due to its training on a vast body of

general knowledge that encompasses a multitude of business cases and situations, ﬁnancial

theories, and economic contexts. This broader theoretical foundation potentially allows

LLMs to infer insights even from unfamiliar data patterns, providing an advantage in the

complex domain of ﬁnancial analysis.

These more complex quantitative tasks have been traditionally seen as outside of the LLM’s “technolog-

ical frontier” (e.g. Dell’Acqua et al., 2023). Consistent with this argument, Li et al. (2023) processes earnings

press releases and ﬁnds that GPT performs worse in predicting earnings relative to sell-side analysts.

Over time, methods for predicting earnings have progressively advanced within the accounting litera-

ture. Ou and Penman (1989) predict earnings changes using a stepwise logistic regression model that uses

approximately 60 accounting variables as input. Most recently, Chen et al. (2022) use 13,881 in-line XBRL

tags and tree-based machine learning models to predict future earnings.

Financial Statement Analysis with Large Language Models 10

3 Methodology and Data

In this section, we outline how we approach the primary task of using an LLM to analyze

and predict earnings changes. Earnings prediction is a complex task that combines qualita-

tive and quantitative analyses and involves professional judgment. We model how analysts

make earnings predictions with a chain-of-thought prompt using GPT 4.

3.1 Financial Statement Analysis and Earnings Prediction

Overview Earnings prediction derived from ﬁnancial statement analysis is of considerable

importance to accounting information users. For example, such predictions help investors

to make inferences about the cross-section of expected stock returns (Fama and French,

2015) or to pick the best-performing stocks (Piotroski, 2000). However, earnings are hard

to predict as they are inﬂuenced by many exogenous factors such as macroeconomic shocks

(Ball et al., 2022), product market demand shocks, changes in accounting standards (Ball

et al., 2000), and many other factors. Therefore, predicting earnings is challenging even for

state-of-the-art ML models (see Bochkay and Levine, 2019; Chen et al., 2022, for example).

Financial analysts approach this complex task by performing ﬁnancial statement analysis.

They ﬁrst analyze ﬁnancial statements, identifying notable changes or trends in accounting

information. They choose which ﬁnancial ratios to compute to obtain further insights. Their

analysis is enriched by contextual information, such as industry information, understanding

of the competitive landscape, and macroeconomic conditions (Bouwman et al., 1987). Based

on this information, they apply professional judgments to determine whether a company’s

earnings will grow or contract in the future.

In this study, we speciﬁcally focus on a relatively narrow information set that includes

numerical information reported on the face of two primary ﬁnancial statements. While

this lacks textual information or broader context and thus puts an LLM at a disadvantage

relative to a human, it presents a well-deﬁned information set of exclusively numeric data.

This approach allows us to test the limits of the model when analyzing ﬁnancials and deriving

insights from the numeric data – something that an LLM is not designed nor trained to do.

To approach FSA-based earnings prediction based on a Large Language Model, we im-

plement two types of prompts. First, we use a “simple” prompt that instructs an LLM to

analyze the two ﬁnancial statements of a company and determine the direction of future

earnings. This prompt does not provide further guidance on how to approach the prediction

task, however.

Second, we implement a Chain-of-Thought prompt that breaks down the

In particular, we simply present a standardized and anonymous balance sheet and income statement

and ask the model to predict whether earnings will increase or decrease in the subsequent period.

Financial Statement Analysis with Large Language Models 11

problem into steps that parallel those followed by human analysts. This prompt eﬀectively

ingrains the methodology into the model, guiding it to mimic human-like reasoning in its

analysis. We mostly focus on the results from this second prompt in our analysis.

Human Processing and Chain-of-Thought Modern large language models can retrieve

numbers from structured tables and perform simple calculations. However, they lack the

ability to reason like a human and perform judgment. Recent research suggests that chain-

of-thought prompting can signiﬁcantly enhance the reasoning and problem-solving abilities

of large language models (Wei et al., 2022).

We implement the CoT prompt as follows. We instruct the model to take on the role

of a ﬁnancial analyst whose task is to perform ﬁnancial statement analysis. The model

is then instructed to (i) identify notable changes in certain ﬁnancial statement items, and

(ii) compute key ﬁnancial ratios without explicitly limiting the set of ratios that need to be

computed. When calculating the ratios, we prompt the model to state the formulae ﬁrst, and

then perform simple computations. The model is also instructed to (iii) provide economic

interpretations of the computed ratios. Then, using the basic quantitative information and

the insights that follow from it, the model is instructed to predict whether earnings are likely

to increase or decrease in the subsequent period. Along with the direction, we instruct the

model to produce a paragraph that elaborates its rationale. Overall, this set of instructions

aims to replicate how human analysts analyze ﬁnancial statements to determine whether a

ﬁrm’s performance is sustainable (Bouwman et al., 1987).

In addition to the binary prediction accompanied by a rationale statement, we also

prompt the model to provide the predicted magnitude of earnings change and the conﬁdence

in its answer (Bybee, 2023; Choi and Kim, 2023). The magnitudes contain three categories:

large, moderate, and small. The conﬁdence score measures how certain the model is in

producing its answers and ranges from zero (random guess) to one (perfectly informed).

We use gpt-4-0125-preview, which is the most updated GPT model by OpenAI at

the time of our experiment. The temperature parameter is set to zero to ensure minimal

variability in the model’s responses. We do not specify the amount of max tokens, and

top-p sampling parameter is set to one (i.e., the most likely word is sampled by the model

with probability one). In addition, we enable the logprobs option to obtain token-level

logistic probability values. Figure 1 provides a visual illustration of GPT’s processing steps.

3.2 Data

We use the entire universe of Compustat annual ﬁnancial data from the 1968 to 2021

ﬁscal years. We also set aside data for 2022 to predict 2023 ﬁscal year earnings to test for

Financial Statement Analysis with Large Language Models 12

the robustness of the model’s performance outside GPT’s training window. In particular,

the GPT-4-Turbo preview’s training window ends in April 2023, and the model cannot have

seen the earnings data of 2023, which was released in late March 2024. Following prior

literature, we require that each observation has non-missing total assets, year-end assets

value exceeding one million dollars, a year-end stock price exceeding one dollar per share,

and a ﬁscal period end date of December 31.

We also drop observations where the balance

sheet equation does not hold. These ﬁlters leave us with 150,678 observations from 15,401

distinct ﬁrms, reasonably approximating the Compustat universe.

For each ﬁrm-year, we reconstruct the balance sheet and income statement using the

data from Compustat. The format follows Capital IQ’s balancing model and is the same

across all ﬁrm years. We omit any identifying information, such as the ﬁrm name or dates of

the ﬁnancial statements. This step ensures that all ﬁrm-year observations have an identical

ﬁnancial statement structure. Consistent with US GAAP reporting requirements, we provide

two years of balance sheet and three years of income statement data. An example of the two

statements is provided in Appendix B.

For the analysis that involves analyst forecasts, we use data from IBES, starting the

sample in 1983. We extract individual forecasts and construct monthly consensus forecasts.

This analysis restricts the sample to ﬁrm-years with analyst following. We require that each

observation has at least three analyst forecasts issued, which leaves us with 39,533 ﬁrm-year

observations.

We report descriptive statistics for the variables used in our analyses in Table 1. Panel A

describes the full sample (1968-2021), and Panel B is restricted to the analyst sample (1983-

2021). The data in Panel A reveals that approximately 55.5% of observations report an actual

increase in earnings (Target). Predicted values include the preﬁx “Pred ” and vary depending

on the model. For example, GPT prediction (P red GP T ) implies that, on average, 53.0% of

observations will experience an increase in earnings. In Panel B, P red Analyst1m denotes

the forecasts issued within one month from the previous year’s earnings release. Analyst

forecasts indexed by 3m and 6m suﬃxes are deﬁned in an analogous manner. Compared to

GPT, ﬁnancial analysts tend to be slightly more pessimistic in their forecasts (ﬂuctuating

around 52% depending on the timing of the forecasts). Panel B also reveals that companies

in the Analyst Sample are, on average, larger in size (Size), have a lower book-to-market

ratio (BtoM), higher leverage (Leverage), and lower earnings volatility (Earn Vol). However,

they are similar in terms of the actual frequency of EPS increases.

Focusing on December 31 ﬁrms allows for more straight-forward asset pricing tests in Section 7 and is

consistent with Ou and Penman (1989); Hunt et al. (2022).

Importantly, we do not train or ﬁne-tune the LLM model on the ﬁnancial statements. The model

observes only a single balance sheet and income statement at a time, as provided in Appendix B.

Financial Statement Analysis with Large Language Models 13

4 How Does an LLM Perform Compared to Financial Analysts?

In this section, we evaluate the performance of a large language model in the analysis

of ﬁnancial statements aimed at predicting the direction of future earnings by using human

analysts as a benchmark. All prediction models have a binary target variable, which indicates

an increase or decrease in EPS in the subsequent year.

4.1 Prediction Methods and Evaluation Metrics

Naive model First, as a naive benchmark, we assume that the directional change in

earnings will stay the same. In particular, if EPS has increased (decreased) in year t relative

to year t − 1, the naive prediction for year t + 1 is also “increase” (“decrease”).

Analysts’ forecasts We use a consensus analyst forecasts of year t + 1 EPS published

following the announcement of year t earnings. If there are multiple forecasts issued by a

single analyst, we use the closest one to the year t earnings release dates. This approach

helps us to ensure that human analysts are making predictions of one-year-ahead earnings

based on ﬁnancial statements published in the current year. Then we take the median value

of analysts’ forecasts and compare it to the actual year t EPS. We require at least three

analyst forecasts in a given ﬁrm-year to compute median values. If the median forecasted

EPS value is larger than the year t EPS, we label the prediction as “increase” and vice versa.

Analyst forecast accuracy is then obtained in an analogous manner.

As a comparison, we also collect analyst forecasts issued at least three and six months

after the release of year t ﬁnancial statements. This ensures that the analysts have enough

time to process the reported ﬁnancials. However, this also means that the analysts will

have access to one or two quarterly ﬁnancial statements and other contextual information

generated during the year t + 1. Therefore, human analysts generally have an informational

advantage relative to the models that rely on time t information only.

Evaluation Metrics We report two common metrics to evaluate the quality of the pre-

diction method: accuracy and F1-score. Accuracy is the percentage of correctly predicted

cases scaled by the total number of predictions made. F1-score is the harmonic mean of

precision and recall. Precision measures the proportion of true positive predictions in the

total positive predictions, while recall measures the proportion of true positive predictions

out of all actual positives. In particular, F1-score is deﬁned as follows:

F 1 =

2 × T P

2 × T P + F P + F N

(1)

Financial Statement Analysis with Large Language Models 14

where T P is the number of true positive predictions, F P is the number of false positive

predictions, and F N is the number of false negative predictions.

4.2 Main Results

Table 2 compares GPT’s prediction accuracy with that achieved by ﬁnancial analysts.

Based on the ﬁrst-month forecast following the release of prior year ﬁnancial statements,

analysts’ accuracy is 52.71% and F1 score is 54.48% when predicting the direction of one-

year-ahead earnings. As expected, this is better than predictions based on a naive model

(accuracy = 49.11% and F1 score = 53.02%). However, these results also reiterate the

notion that changes in earnings are very hard to predict, even for sophisticated ﬁnancial

analysts. As expected, the analysts’ prediction accuracy improves through the course of

the year t + 1, achieving an accuracy of 55.95% and 56.58% for month-three and month-six

forecasts, respectively.

Turning to GPT’s predictions, we observe the following: Using a simple prompt instruct-

ing GPT to analyze ﬁnancial statements and predict the direction of future earnings yields an

accuracy of 52.33% and an F1-score of 54.52%. Thus, without CoT reasoning, the model’s

performance is on par with the ﬁrst-month consensus forecasts by ﬁnancial analysts, fol-

lowing the earnings release. However, the performance markedly improves when we utilize

CoT-based GPT forecasts. With chain-of-thought prompts, GPT achieves an accuracy of

60.35%, or a 7 percentage points increase compared to analyst predictions one month after

the earnings release. The diﬀerence is statistically signiﬁcant at 1% level.

This edge is

particularly noteworthy since we do not provide to the language model any available to the

analysts narrative or contextual information beyond the balance sheet and income statement.

Taken together, our results suggest that GPT can outperform human analysts by per-

forming ﬁnancial statement analysis even without any speciﬁc narrative contexts. Our results

also highlight the importance of a human-like step-by-step analysis that allows the model to

follow the steps typically performed by human analysts. In contrast, simply instructing the

model to analyze complex ﬁnancial statements does not yield strong prediction results.

4.3 Complementarity Between Human Analysts and GPT

Given that GPT outperforms human analysts in predicting future earnings, this ﬁnding

raises the question of whether an LLM can largely replace human analysts. In our context,

humans are expected to rely on a broader information set and hence should have an advantage

over an LLM that does not have access to qualitative information, for example. More

GPT outperforms human analysts in terms of accuracy under 5% statistical signiﬁcance.

Financial Statement Analysis with Large Language Models 15

generally, humans often rely on soft information not easily accessible to a machine (Costello

et al., 2020; Liu, 2022), which puts humans at an informational advantage. We next explore

the presence of complementarities and trade-oﬀs related to LLM vs. human forecasts.

Sources of Incorrect Answers We start with the analysis of instances where forecasts are

erroneous. We estimate a simple linear regression to examine whether ﬁrm characteristics

have systematic associations with prediction accuracy. I(incorrect = 1) is an indicator

variable that equals one when the earnings prediction does not match the actual change in

earnings. We then estimate the following OLS regression:

I(incorrect = 1)

= βX

+ δ

year

+ δ

ind

+ ε

(2)

is a vector of ﬁrm i’s year t characteristics: asset size, leverage, book-to-market ratio,

earnings volatility, loss indicator, and property, plant, and equipment scaled by total assets.

year

and δ

ind

denote year and industry (SIC two-digit) ﬁxed eﬀects, respectively. All con-

tinuous variables are winsorized at the 1% level and standard errors are clustered at the SIC

two-digit industry level.

We present the results in Table 3, Panel A, and Figure 2. In column (1), we document

that GPT’s predictions are more likely to be inaccurate when the ﬁrm is smaller in size,

has a higher leverage ratio, records a loss, and exhibits volatile earnings. These results are

intuitive and, notably, prior studies ﬁnd these characteristics to be economically associated

with earnings quality.

For comparison, in columns (2), (3), and (4), we report the determi-

nants of analysts’ inaccurate predictions. Several interesting diﬀerences emerge compared to

column (1). First, even though analysts face diﬃculties in predicting small ﬁrms’ earnings,

the magnitude of these coeﬃcients is nearly half compared to the coeﬃcient in column (1)

(p-value is less than 1% for all three comparisons). Considering that analysts have access to

narrative information and broader context, this result is consistent with Kim and Nikolaev

(2023b), who show that context matters more for prediction tasks when the ﬁrm is smaller

in size. Another notable diﬀerence is that analysts are less likely to make errors relative to

Due to high ﬁxed costs of maintaining adequate internal controls, small ﬁrms may have lower-quality

accounting earnings (Ball and Foster, 1982; Ge and McVay, 2005) and are more likely to restate their

earnings in subsequent periods (Ashbaugh-Skaife et al., 2007). High leverage ratios are often indicative of

ﬁrms being closer to debt covenant violations. Such ﬁrms might be more incentivized to engage in earnings

management to meet or beat ﬁnancial thresholds, leading to lower-quality earnings (Watts and Zimmerman,

1986). Also, when ﬁrms experience unusual ﬁnancial circumstances such as reporting losses, analysts tend to

perform worse than average (Hwang et al., 1996; Hutton et al., 2012). Lastly, Donelson and Resutek (2015)

document that past volatility of earnings is negatively associated with its predictive power. Considering that

GPT only uses numerical ﬁnancial information as its input, these results align well with Kim and Nikolaev

(2023a,b) that contextual information becomes relatively more important when ﬁrms experience losses and

their size is small.

Financial Statement Analysis with Large Language Models 16

GPT when a ﬁrm reports a loss and exhibits volatile earnings. These ﬁndings are the same

for all analyst forecast measures as the magnitudes of the coeﬃcients on Loss and Earn-

ings Volatility in columns (2), (3), and (4) are consistently smaller than that of column (1).

Taken together, our results show that analysts and GPT both have diﬃculties in predicting

the earnings of small, loss-reporting ﬁrms. However, analysts tend to be relatively better

at dealing with these complex ﬁnancial circumstances than GPT, possibly due to other soft

information and additional context (Costello et al., 2020).

Incremental Informativeness We next test whether analysts’ forecasts, despite lower

accuracy, add useful insights incremental to GPT’s predictions. We regress an indicator

I(Increase = 1), which equals one when subsequent period earnings increase and zero

otherwise, on the direction of future earnings predicted by GPT and/or analysts. Speciﬁcally,

we estimate the following OLS regression:

I(Increase = 1)

= β

P red GP T

+ β

P red Analyst

+ δ

year

+ δ

ind

+ ε

(3)

where P red X is an indicator that equals one when ”X” (which is either “GPT” or “An-

alyst”) predicts an increase in earnings, and zero otherwise. δ

year

and δ

ind

are year and

industry (SIC two-digit level) ﬁxed eﬀects. Standard errors are clustered at the industry

level.

The results are presented in Table 3, Panel B. In column (1), we ﬁnd that GPT’s predic-

tion, on a standalone basis, is positively associated with future outcomes while controlling

for industry and year-ﬁxed eﬀects. The same result holds for individual analysts’ forecasts

as can be seen in columns (2), (3), and (4). Consistent with the results in Table 2, analysts’

forecasts issued six months after the earnings release exhibit stronger associations with the

actual outcomes than the forecasts issued one month after the earnings release (the adjusted

R-squared in column (4) is 0.044, which is almost twice the adjusted R-squared value in

column (2)).

In columns (5), (6), and (7), we include both GPT and analyst forecasts simultaneously

in a single regression. Across all models, both coeﬃcients are statistically signiﬁcant. We

observe that the coeﬃcient on GPT is largely unchanged (its t-statistics marginally decreases

from 2.99 to 2.67) and the coeﬃcient on analysts’ predictions increases in magnitude when

both variables are used simultaneously (e.g., from 0.073 in column (2) to 0.110 in column (5)).

The adjusted R-squared value also increases from 0.070 in column (1) to 0.089 in column

(5). These results indicate that GPT and human analysts are complementary, corroborating

Financial Statement Analysis with Large Language Models 17

our results in Table 3.

Does GPT Do Well When Humans Struggle? To explore the relative advantage of

an LLM compared to human analysts, we examine instances when human analysts are likely

to struggle with accurately forecasting earnings. In particular, we identify instances where

analyst forecasts are likely to be biased or ineﬃcient ex ante. We also consider instances in

which analysts tend to disagree about future earnings (exhibit dispersion).

To estimate ex-ante bias (ineﬃciency) in analysts’ forecasts, we run cross-sectional re-

gressions of analyst forecast errors on the same ﬁrm characteristics as in Equation 2. We

then take the absolute value of the ﬁtted values from this regression.

Consistent with prior

literature, forecast errors are deﬁned as the diﬀerence between actual EPS and forecasted

EPS, scaled by the stock price at the end of the last ﬁscal year. In addition to ex-ante bias,

we measure the disagreement in analysts’ forecasts. Speciﬁcally, we use the standard devia-

tion of analysts’ forecasted EPS values, scaled by the stock price at the end of the preceding

ﬁscal year.

We then partition the sample based on the quartile values of analyst bias and estimate

Equation 3 for each group. The results are presented in Panel C of Table 3. By comparing

the coeﬃcients in columns (1) and (2), we observe important diﬀerences. When the analysts’

bias is expected to be relatively low, GPT’s predictions receive a smaller weight (compared

to that in column (2) when the bias is expected to be higher), and the coeﬃcient on analysts’

predictions is relatively large. These diﬀerences are statistically signiﬁcant at the 1% level.

They suggest that GPT is more valuable in situations when human analysts are likely to

be biased. Similar results follow in columns (3) and (4) when we partition the sample on

analyst disagreement: GPT’s prediction receives more weight when analysts’ disagreement

is high and vice versa.

Taken together, our results indicate that GPT’s forecasts add more value when human

biases or ineﬃciencies are likely to be present.

5 Comparison with Specialized ML Models

So far, we have shown that GPT’s predictions largely outperform human analysts. As

human analysts are known to have a systematic bias in their forecasts, we raise the bar and

turn to more sophisticated benchmarks, including state-of-the-art machine learning models.

Note that errors should be unpredictable if forecasts are unbiased and eﬃcient.

Financial Statement Analysis with Large Language Models 18

5.1 Methodology

Following Ou and Penman (1989) and Hunt et al. (2022), we focus on 59 ﬁnancial vari-

ables obtained from the Compustat Annual database to predict future earnings but exclude

the price-to-earnings ratio for consistency reasons (stock price is not ﬁnancial statement in-

formation). We perform two diﬀerent prediction exercises: stepwise logistic regression and

ANN. In both cases, we use a rolling ﬁve-year training window. That is, we estimate (train)

the model using data from years t − 5 to t − 1, and apply the trained model to the year t

data to generate forecasts. By doing so, we ensure that the models do not learn from the

test data during the training phase. Since our sample spans from ﬁscal year 1962 to 2021,

we train 56 distinct models for each prediction method.

In the stepwise logistic regression, we follow Ou and Penman (1989) and only retain the

signiﬁcant variables from the ﬁrst step when performing the second step of the procedure.

The trained logistic regression then yields a probability value instead of a binary variable as

its output. We classify observations with a probability value higher than 0.5 as an increase

(and a decrease otherwise). In contrast to the logistic regression, the ANN model allows

for non-linearity among the predictors. Our model has an input layer with 59 neurons, two

hidden layers with 256 and 64 neurons each, and an output layer with two neurons (Kim

and Nikolaev, 2023a). The output layer produces a two-dimensional vector (p

, p

), and we

classify the outcome as an increase when p

> p

and vice versa. We use Adam optimizer,

ReLU activation function, and cross-entropy loss. We use batch training with a batch size of

128. All input variables are standardized. Missing continuous variables are imputed as the

year-industry average. We apply early stopping criteria with a patience of ﬁve epochs, which

indicates that the model stops training when there is no improvement in performance for

ﬁve consecutive epochs.

For each training phase, we assign a random 20% of the training

sample to the validation set and optimize the learning rate and dropout rate. Speciﬁcally,

we perform a grid search of nine iterations, using three learning rates (1e

−5

, 1e

−3

, and 1e

−1

)

and three dropout rates (0, 0.2, and 0.4).

5.2 Main Results

Overall Results We report the results in Table 4, Panel A, and Figure 3. Stepwise logistic

regressions following Ou and Penman (1989) achieve an accuracy of 52.94% and an F1 score

of 57.23%. We observe a considerably higher prediction accuracy using the ANN model.

The model achieves a 60.45% accuracy and an F1-score of 61.62%. This result highlights the

importance of non-linearities and interactions among ﬁnancial variables for the predictive

The maximum allowed training epochs is set to 50 yet none of the models hit this limit.

Financial Statement Analysis with Large Language Models 19

ability of numerical information.

Consistent with the results in the analyst sample, our CoT-based GPT predictions achieve

an accuracy of 60.31%, which is on par with the specialized ANN model. In fact, in terms

of the F1-score, GPT achieves a value of 63.45%, which is the highest among all prediction

methods. This indicates a remarkable aptitude of GPT to analyze ﬁnancial statements.

Not only does it outperform human analysts, but it generates performance on par with the

narrowly specialized state-of-the-art ML applications.

We further examine the possibility that ANN versus GPT performance is partly driven

by the slightly diﬀerent input variables: we use balance sheet and income statement variables

for GPT, but 59 Ou and Penman (1989) ratios for ANN. Thus, to ensure that the results

are not an artifact of this choice, we also train an ANN model using the same balance sheet

and income statement variables. We scale balance sheet items by total assets and income

statement items by total sales. We also include change in revenue, change in lagged revenue,

change in total assets, and revenue scaled by total assets. The ANN model with ﬁnancial

statement information achieves an accuracy of 60.12% and an F1-score of 61.30%, which are

slightly lower than those of GPT.

Time Trends We report the overall time trend of GPT’s and ANN’s prediction accuracy

in Figure 4 (detailed annual accuracy and F1-scores are reported in Appendix A). The left

panel shows a negative time trend in GPT’s prediction accuracy. In terms of the economic

magnitude, GPT’s accuracy has decreased, on average, by 0.1% point per year, which trans-

lates into a decrease in accuracy by 5.4 percentage points over the 54-year sample period.

Interestingly, we observe sharp drops in prediction accuracy in 1974, 2008-2009, and 2020.

These periods overlap with international macroeconomic downturns: the oil shock in 1974,

the ﬁnancial crisis in 2008-09, and the Covid-19 outbreak in 2020. This result is comforting

as GPT should not foresee unexpected, exogenous macroeconomic shocks if its performance

is unrelated to memory.

Most importantly, in the right panel of Figure 4, we plot the

time-series trend of the “diﬀerence” in the accuracy of GPT and ANN models. ANN models

exhibit similar time trends compared to GPT with their annual diﬀerences ﬂuctuating close

to zero. Thus, for both evaluation metrics, we ﬁnd a negative and statistically signiﬁcant

time trend, implying that it has become increasingly diﬃcult to predict future earnings using

only numeric information.

GPT outperforms stepwise logistic predictions at 1% level. However, the diﬀerence between GPT and

ANN performance is not statistically signiﬁcant at conventional levels.

We discuss this potential issue more formally in Section 6.1.

This result corroborates Kim and Nikolaev (2023a), who ﬁnd that the informational value of narrative

context in predicting future earnings has increased over time.

Financial Statement Analysis with Large Language Models 20

Sources of Inaccuracy Next, we explore which ﬁrm characteristics are associated with

the likelihood of making incorrect earnings predictions. Column (1) of Table 4 focuses

on the accuracy of GPT’s predictions and is consistent with our ﬁndings for the analyst

sample (Table 3). We then report the determinants of the incorrect predictions by ANN

and logistic regression models in columns (2) and (3), respectively. Both the ANN and

logistic regression are also more likely to generate inaccurate predictions when ﬁrms are

smaller, have higher leverage, record a loss, and have higher earnings volatility. However,

interestingly, ANN is relatively more likely than GPT to make inaccurate predictions when

ﬁrms are smaller and record a loss. A one standard deviation decrease in ﬁrm size reduces

GPT’s prediction accuracy by 3.4 percentage points. In contrast, the same change in ﬁrm

size is associated with a 5.5 percentage point decrease in prediction accuracy for the ANN

model. The diﬀerence between the two coeﬃcients is statistically signiﬁcant at the 1% level.

Similarly, the coeﬃcients on Loss and Earnings Volatility are statistically diﬀerent at the

5% level. The diﬀerences between logistic regression and GPT predictions are even more

pronounced. These ﬁndings hint at the ability of GPT to make better predictions for less

common data patterns (e.g., loss-making ﬁrms), presumably due to its ability to rely on its

conceptual knowledge and theoretical understanding of business.

Incremental Informativeness While GPT’s performance is comparable to that of an

ANN, we also examine whether GPT conveys incremental information when compared to

specialized ML models. This analysis is reported in Panel C. In columns (1) to (3), we show

that across all models, predicted earnings changes, individually, are positively associated

with the actual changes. In column (4), when both GPT and ANN forecasts are included

simultaneously, both remain statistically signiﬁcant and hence contain incremental informa-

tion. Interestingly, the coeﬃcient on ANN becomes one-third in magnitude (compared to

column (2)) and its statistical signiﬁcance deteriorates (from a t-statistic of 3.69 to 2.36),

whereas the coeﬃcient on GPT remains stable. This result suggests that GPT captures some

additional dimensions of information than non-linear interactions among ﬁnancial variables

when predicting future earnings, e.g., external theoretical knowledge.

5.3 Conﬁdence, Magnitude, and Generalizability

5.3.1 LLM’s Conﬁdence

Method We estimate the conﬁdence of LLM’s answers based on two methods. First, we

explicitly instruct the model to report a conﬁdence score on its earnings prediction, with one

being perfect conﬁdence and zero being a pure guess (Bybee, 2023). Second, we compute

Financial Statement Analysis with Large Language Models 21

an alternative conﬁdence score based on token-level logistic probability values, which we

directly take from the probability vector provided by the model. Speciﬁcally, we average the

logistic probability values across all output tokens to measure the overall certainty of the

model answer.

Results For both approaches, we report prediction results of the high conﬁdence (fourth

quartile) and the low conﬁdence (ﬁrst quartile) groups. We present the results in Figure 5 and

columns (1) to (4) of Table 5. The model performs better when it reports greater conﬁdence.

In the high conﬁdence group, the model achieves an average accuracy of 62.44% (63.15%)

based on the reported conﬁdence value (conﬁdence score derived from logistic probabilities),

which is approximately 2.6 (4.6) percentage points higher than the corresponding accuracy

of the low conﬁdence group. We ﬁnd similar results based on the F1 score. Overall, this

result indicates that the model is capable of distinguishing between instances where earnings

are more predictable.

5.3.2 Magnitude

Method Recall that we also instruct the model to provide the expected magnitude of

earnings change: “large”, “moderate”, or “small.” As in Ou and Penman (1989) and Hunt

et al. (2022), we expect the model to be more accurate in determining the directional change

when it predicts large rather than immaterial changes.

Results We present the results in Figure 5 and columns (5) and (6) of Table 5. We ﬁnd that

the average accuracy is 62.03% when the model predicts large changes whereas it decreases

to 60.22% for small changes. We document a similar pattern for F1 scores: 61.16% for large

changes vs. 57.95% for small changes. Overall, when the model expects a larger change, its

directional predictions are more accurate.

5.3.3 LLM type

Method We also test whether the capabilities associated with a speciﬁc LLM type deter-

mine its predictive ability. In the main analysis, we use the most recent version of GPT,

GPT-4-turbo. We also experimented with a less powerful LLM version from the same fam-

ily, GPT-3.5-turbo, and otherwise used the same experimental settings. In addition, we also

explored another family of LLMs provided by Google, namely, Gemini Pro 1.5 (also with

the same experimental settings). Due to considerable processing time, we choose a random

20% sample for this set of analyses.

Financial Statement Analysis with Large Language Models 22

Results We present the results in Figure 5 and Table 5, columns (7) to (9). GPT 4

achieves the best performance, followed by Gemini 1.5, and GPT 3.5. Gemini 1.5 achieves

an overall accuracy of 59.15%, which is close to that of GPT 4 (61.05%) in the same 20%

sample. However, GPT 3.5 achieves an accuracy of only 52.29% and an F1-score of 59.17%,

which are all substantially lower than our GPT 4 benchmarks. We also ﬁnd that the outputs

of GPT 4 and Gemini 1.5 are largely overlapping with only 1,808 out of 30,135 ﬁrm-years

(approximately 6%) having opposing predictions. Overall, this analysis suggests that our

ﬁndings are not conﬁned to a speciﬁc family of LLMs. Although the ﬁnal prediction results

largely rely on the performance of the backbone language model, recent generations of LLMs

are capable of analyzing ﬁnancial statements and making informed decisions.

6 Where Does an LLM’s Predictive Ability Come From?

In this section, we aim to understand the sources of GPT’s predictive ability. We explore

two broad explanations. The ﬁrst explanation is that GPT’s performance comes from its

memory, e.g., due to the model’s ability to identify the company based on numeric data.

We aim to rule out this possibility as it undermines the integrity of the model’s predictions.

Another explanation is that the strength of the model is in its ability to generate narrative

insights based on its analysis of numeric data. We explore each of these possibilities next.

6.1 Is There a Look-ahead Bias in the Model?

An important concern with the reliance on a pre-trained large language model in a

prediction task is its potential for a look-ahead bias (e.g. Sarkar and Vafa, 2024). For

example, the model may have been trained on the company-speciﬁc ﬁnancial data and,

hence, already may “know” the answer as to whether earnings increased or decreased in the

future (or have a general sense of how well the company did over time). Our research design

is relatively immune from this potential bias (e.g. Glasserman and Lin, 2023) because we use

a consistent anonymized format of ﬁnancial statements across ﬁrms. This makes it virtually

impossible for the model to infer a ﬁrm’s identity from the structure of ﬁnancial statements

or speciﬁc account names. We also ensure the statement does not contain any dates and use

relative years, i.e., t or t − 1. This later mitigates the concern that the model has knowledge

about macroeconomic trends in a speciﬁc year and uses it to predict future earnings. To

appreciate this issue, imagine that the model was able to match a given set of ﬁnancials to

2007. In this case, the model could draw on its knowledge of the major economic downturn

in 2008 and adjust its prediction accordingly.

Even though the anonymous nature of ﬁnancial statements should prevent the model from

Financial Statement Analysis with Large Language Models 23

“guessing” the entity, we perform two formal analyses to further rule out this concern.

Can GPT Guess Firm Name and Year? In this set of tests, we instruct the model

to make guesses about the ﬁrm or year based on the ﬁnancial statements that we provide.

Speciﬁcally, we ask the model to provide the ten most probable ﬁrm names and the most

probable ﬁscal year. Additionally, we force the model to produce outputs even when it

believes that it cannot make any informed guess.

For economic reasons, our ﬁrst set of experiments does not include any chain-of-thought

prompts. We perform this experiment on 10,000 random observations. The results are

presented in Table 6, Panel A. We ﬁnd that the model correctly identiﬁes the ﬁrm name

with an accuracy of 0.07%, which is lower than the accuracy of a random guess from the

population of names in our data. In Figure 7 left panel, we plot the ten most frequently

produced ﬁrm names. We ﬁnd that the model almost always predicts the same set of ten

ﬁrms, including Tesla, Facebook, and Amazon. This result is consistent with the model’s

training objective to produce the most probable words (name in this case) conditional on its

information. Absent an informative prior, the model is likely to predict the most visible or

popular ﬁrms in its training corpus.

The accuracy of correctly guessing the year of ﬁnancial statements is 2.95%. In the right

panel of Figure 7, we plot the actual ﬁscal year and GPT’s prediction in one plane. We

observe that almost all predictions are 2019, 2020, or 2021 independent of the actual year,

which is inconsistent with the model’s ability to guess the year.

In the second set of experiments, we use the exact same chain-of-thought prompts as

in the main analysis, but then ask the model to guess the ﬁrm name and year (instead of

predicting earnings). We use a random sample of 500 observations. Panel B of Table 6

contains the results. The ﬁndings conﬁrm very low accuracy and thus address a potential

concern that the CoT prompt is more capable of invoking the model’s memory. Taken

together, our results strongly suggest that the model cannot make a reasonable guess about

the entity or the ﬁscal year from the anonymous ﬁnancial statements. Therefore, it is highly

unlikely that the model is inadvertently using its “memory” about ﬁnancial information to

make earnings predictions.

Analysis outside of GPT’s training window As suggested in Sarkar and Vafa (2024),

the most eﬀective way to rule out the model’s look-ahead bias is to perform a test outside

Compared to de-identiﬁed ﬁnancial statement data, anonymizing textual data is conceptually more

challenging. Textual data, such as earnings calls, may still retain suﬃcient contextual information that

potentially allows the model to guess the anonymized ﬁrm.

Refer to Appendix D for computing the accuracy of a random guess.

Financial Statement Analysis with Large Language Models 24

of the model’s training window. OpenAI’s GPT4-Turbo preview was trained on data up to

April 2023, thereby signiﬁcantly limiting the scope to conduct this analysis. Nevertheless,

we use ﬁnancial statement data from ﬁscal year 2022 (released in January-March 2023) to

predict earnings of ﬁscal year 2023 (released in early 2024).

We present the results in Table 6, Panel C. As a comparison, we also report prediction

results of the logistic regressions, analyst predictions, and ANN models. GPT achieves an

accuracy of 58.96% and an F1 score of 63.91%. The accuracy (but not the F1 score) is slightly

lower than the average reported in Table 4, Panel A. However, recall that we ﬁnd an overall

decreasing time trend in GPT’s prediction accuracy. Speciﬁcally, as shown in Appendix

A, GPT’s prediction accuracy is only 54.36% for the ﬁscal year 2021, and 59.01% for 2019

(GPT’s prediction accuracy plummets in 2020 during Covid-19 outbreak). In fact, both

the out-of-GPT-sample accuracy and the F1 score are substantially higher than the average

over the last 10 years (58.01% and 59.15%). Therefore, we interpret our results as GPT’s

out-of-sample performance being closely in line with our “in-sample” results. Furthermore,

GPT achieves a very similar accuracy score out-of-sample as the ANN model (58.96% versus

59.10%) and an even higher F1 score (63.91% versus 61.13%) for the same year, which is

closely in line with our main ﬁndings. Taken together, this result corroborates our prior tests

and conﬁrms that the model’s predictive ability does not stem from its training memory.

6.2 Are LLM-Generated Texts Informative?

Next, we explore whether the model’s predictive ability comes from its ability to generate

narrative insights about the ﬁnancial health and future performance of the company, in

line with its objective to analyze ﬁnancial statements. We leverage the fact that our CoT

prompts instructed the model to provide information besides the prediction itself: narrative

description and interpretations of trend and ratio analyses, as well as the rationale behind the

binary predictions. We start with descriptive analyses of the generated texts. Subsequently,

we evaluate the information content of texts generated by GPT.

Descriptive Bigram Analysis We begin with a descriptive approach, performing a con-

tent analysis of the texts generated by the model. This analysis involves counting the most

common bigrams in the ratio analysis and the most common monograms (single words) in

the rationale section. This method allows us to discern patterns and dominant themes that

may contribute to the model’s analytical performance.

We present the results in Figure 7. In the left panel, we report the top ten most fre-

quently used bigrams in the ratio analysis. We calculate the frequency by scaling the bigram

counts with the total number of bigrams generated by the model. We ﬁnd that the model

Financial Statement Analysis with Large Language Models 25

most commonly refers to the operating margin. In addition to the proﬁtability information,

the model also frequently computes eﬃciency (asset and inventory turnover) and liquidity

(current ratio, current assets, and current liability). The model’s rationale in making ﬁ-

nal predictions is generally consistent with its bigram analysis. In its decision, the model

commonly refers to ﬁrm growth, liquidity, operating proﬁtability, and eﬃciency. This con-

sistent alignment between the themes identiﬁed in the bigram analysis and the model’s ﬁnal

predictions underscores the utility of LLM-generated texts in capturing essential ﬁnancial

indicators.

Information Content of Generated Text We hypothesize that GPT is capable of

predicting future earnings because it distills narrative insights about the ﬁnancial health

of the company from the numeric data. We thus examine whether GPT-generated texts

contain information that is useful for predicting the direction of future earnings. To do so,

we process each GPT output with a BERT-base-uncased model to obtain its 768-dimensional

vector representation (note that GPT does not allow retrieving native embeddings, and thus,

we use BERT).

We then design a new ANN model that uses these textual embeddings as

inputs and train the ANN to predict the direction of future earnings (target variable). The

model has two hidden layers, with dimensions of 256 and 64, and an output layer with two

dimensions: probabilities of earnings increase vs. decrease (p

, p

). We classify the outcome

as an increase when p

> p

and vice versa. The model is otherwise analogous to the ANN

models we estimated earlier.

We refer to this model as the embeddings-based model.

We report the accuracy, F1-score, and the area under the ROC curve (AUC) of the

trained model in Table 7, Panel B (note that we were not able to measure AUC for GPT

forecasts and thus did not report it previously). Our embedding model achieves an accuracy

of 58.95%, an F1-score of 65.26%, and an AUC of 64.22%. It is noteworthy that this model

achieves the highest F1 score among all classiﬁcation methods we examined previously. For

comparison purposes, the second row of the table repeats the results of the ANN model based

on variables from the two ﬁnancial statements, which was previously reported in Table 4.

This model achieves only a somewhat higher accuracy of 60.12%, but a considerably lower

F1-score (61.30%) and AUC (59.13%). Overall, our results indicate that narrative text

We use the last hidden stage vector of the CLS token associated with a given narrative. In case the

narrative exceeds 512 tokens, we partition the text into chunks and take the average over chunk-speciﬁc

vectors.

ReLU activation function is used for the ﬁrst two layers and the sigmoid function is used in the last

layer. We minimize cross-entropy loss and use the Adam optimizer. As in our main ANN model, we use

rolling ﬁve-year windows to train the model. We use a batch size of 128 and the model stops training when

there is no improvement for ﬁve consecutive training epochs. We perform a grid search of nine iterations,

using three values of learning rates (1e

−5

, 1e

−3

, and 1e

−1

) and three values of dropout rates (0, 0.2, and

0.4), on random 20% of the training sample.

Financial Statement Analysis with Large Language Models 26

generated by GPT contains a signiﬁcant amount of information useful in predicting future

earnings, i.e., it indeed represents narrative insights derived from numeric data based on the

CoT prompt. This result suggests that the narrative insights serve as the basis for GPT’s

superior predictive ability. In untabulated results, we ﬁnd that the correlation between GPT

forecasts and the embeddings-based forecasts of future earnings direction have a correlation

of 94%, which suggests that both rely largely on the same information set.

As additional analyses, we experiment with diﬀerent ANN speciﬁcations by changing the

input vectors. First, following Kim and Nikolaev (2023a), we include both textual vectors

(GPT insights) and numeric data (scaled variables from ﬁnancial statements) into the model,

allowing for full non-linear interactions among the two inputs. This model is reported in row

(3) of the table. We ﬁnd that the dual-input model achieves the highest accuracy metrics:

accuracy of 63.16%, an F1-score of 66.33%, and an AUC of 65.90%. This result reconciles

with our prior evidence that GPT forecasts have incremental information beyond numeric

inputs and also highlights the value of considering the narrative insights generated by an

LLM when interpreting numerical information.

Finally, we examine the relative importance of diﬀerent parts of the ﬁnancial statement

analysis performed by GPT. Speciﬁcally, the model analyzes trends, then switches to the

ratio analysis, and concludes by providing a rationale behind its prediction. We obtain em-

bedding vectors for each of the three types of generated narratives with the goal of assessing

their relative importance. Speciﬁcally, we estimate three ANN models each of which leaves

out one type of embedding vectors from the analysis. The ANN model that omits trend anal-

ysis exhibits an accuracy of 57.11%, which is approximately 1.8 percentage points lower than

that of the ANN model that uses the entire text embedding. The ANN model, excluding

ratio analysis, achieves an accuracy of 55.65%, which is almost 3.3 percentage points lower

than that of the full ANN model. These results indicate that ratio and, subsequently, trend

analysis add the highest and second highest informational value, respectively, when deter-

mining the future direction of the company. In contrast, excluding the rationale narrative

does not change the model performance substantially (58.88%), implying that the rationale

does not add information beyond the trend and ratio analyses.

7 Asset Pricing Tests

Having demonstrated that GPT’s predictions of the earnings direction have high accu-

racy and stem from the model’s ability to generate insights rather from memory, we now

investigate the practical value of an LLM-based ﬁnancial statement analysis by evaluating

trading strategies based on GPT’s output.

Financial Statement Analysis with Large Language Models 27

In particular, signals that are informative about future expected proﬁts should exhibit

a positive association with expected stock returns in the cross-section of ﬁrms (Fama and

French, 2015). The asset pricing models typically use the current level of proﬁtability as a

proxy for future expected future proﬁtability (Novy-Marx, 2013). To the extent GPT fore-

casts have incremental information about future proﬁtability, they should also predict future

stock returns. We use GPT forecasts of whether earnings are likely to increase or decrease

in the subsequent period, to form an investment strategy and evaluate its performance.

7.1 Methodology

Because our sample includes ﬁrms with December 31 ﬁscal year-end, their ﬁnancial results

are released by the end of March. Following prior literature, we allow approximately three

months for the market to fully process the reported information and form portfolios on June

30 of each year. We hold the portfolio for one year and measure their Sharpe ratios and

monthly alphas. We compare three types of strategies. The ﬁrst strategy sorts stocks into

portfolios based on GPT forecasts, and the other two perform sorts based on ANN and

logistic regression forecasts that rely on numeric information.

ANN and Logistic Regressions ANN and logistic regressions yield probabilities that

earnings will increase in the subsequent year. We use these predicted probabilities to sort

the stocks into ten portfolios. Then on June 30, each year, we take long positions in the top

decile stocks and short stocks in the bottom decile.

GPT Because GPT does not provide probabilities that earnings will increase or decrease,

we follow a diﬀerent approach to form portfolios. We rely on three pieces of information:

binary directional prediction, magnitude prediction, and average log probability of tokens.

In particular, for each ﬁscal year, we select stocks predicted to experience an “increase”

in earnings with the predicted magnitude (of the change in earnings) of either “moderate”

or ”large.” Then we sort those stocks on the average log probability values associated with

the generated text. This allows us to choose stocks with relatively more conﬁdent forecasts

(recall that model answers with high certainty are more accurate than the ones with low

certainty). We then retain stocks with the highest log probabilities such that the number of

ﬁrms retained each year constitutes 10% of our sample in that year (our goal is to construct

an equivalent to a decile portfolio). We also do the same for the stocks predicted to experience

a “decrease” in earnings. We ﬁlter stocks with a predicted magnitude of either “moderate”

or “large”, and sort them on log probability values. We then short the same number of stocks

as that in the long portfolio, i.e., retain 10% from the total number of observations in that

Financial Statement Analysis with Large Language Models 28

year with the highest expected conﬁdence. By doing so, we match the number of stocks to

the number of stocks included in ANN or logit-based portfolios.

7.2 Results

Sharpe Ratios To compute Sharpe ratios, we form equal-weighted and value-weighted

portfolios. For value-weighted portfolios, we rebalance the portfolio weights each month.

Although value-weighted portfolios are less sensitive to small market capitalizations, it is

diﬃcult to rebalance the portfolios based on the stocks’ time-varying market caps in practice

(Jiang et al., 2022). Recall that our prior ﬁndings suggest that GPT appears to have an

advantage in analyzing smaller and relatively more volatile companies. We thus present the

outcome of both the value- and equal-weighted strategies.

The results are presented in Table 8, Panel A. We ﬁnd that equal-weighted portfolios

based on GPT predictions achieve a Sharpe ratio of 3.36, which is substantially larger than

the Sharpe ratio of ANN-based portfolios (2.54) or logistic regression-based portfolios (2.05).

In contrast, for value-weighted portfolios, we observe that ANN performs relatively better

(Sharpe = 1.79) than GPT (1.47). Both dominate the logistic regressions (0.81).

This

result is consistent with our ﬁnding in Table 4 that both GPT and ANN contain incremental

information and are thus complementary. Overall, this analysis shows potential for using

GPT-based ﬁnancial statement analysis to derive proﬁtable trading strategies.

Alphas Next, we compute monthly alphas for each of the three investment strategies

described above based on ﬁve diﬀerent factor models, from CAPM to Fama and French

(2015)’s ﬁve factors plus momentum. We present the results in Table 8, Panel B.

Consistent with the results in Panel A, equal-weighted portfolios generate higher alphas

in general. As expected, we observe a signiﬁcant reduction in alphas when we include the

proﬁtability factor in column (4) (from 1.29 to 0.97 for portfolios based on GPT predictions),

which is another proxy for future proﬁtability. However, even after controlling for ﬁve factors

and momentum, portfolios based on GPT’s predictions generate a monthly alpha of 84 basis

points (column (5)), or 10% annually. Portfolios based on ANN and logistic regression

estimates also generate positive alphas. However, their magnitudes and economic signiﬁcance

are smaller (60 basis points with a t-statistic of 1.89 for ANN and 43 basis points with a

t-statistic of 1.96 for logistic regressions).

In Figure 8, we plot the cumulative log returns of portfolios based on GPT’s predictions

from 1968 to 2021. The left panel shows the cumulative log returns for equal-weighted long

Accounting for 10 basis points in transaction costs, the GPT-based equal-weighted portfolio yields a

Sharpe ratio of 2.84. The value-weighted portfolio yields a Sharpe ratio of 0.95.

Financial Statement Analysis with Large Language Models 29

and short portfolios separately. As expected, the long portfolio substantially outperforms

the short portfolio. In the right panel, we plot the cumulative log returns for the long-short

portfolio and compare them with the log market portfolio returns (dotted line). Notably,

our long-short portfolio consistently outperforms the market portfolio even when the market

experiences negative cumulative returns.

For value-weighted portfolios, consistent with Sharpe ratio results, ANN-based portfolios

perform better compared to GPT with 50 basis points alphas even after controlling for the

ﬁve factors and momentum. Portfolios based on GPT’s predictions achieve 37 basis points

alpha with a t-statistic of 2.43 (column (10)). Portfolios based on logit estimates also exhibit

positive alphas (31 basis points) though they are marginally insigniﬁcant (t-statistic = 1.55).

Overall, our analysis demonstrates the value of GPT-based fundamental analysis in stock

markets. We also note that the stronger (weaker) GPT’s performance compared to ANN

when evaluated on equal-weighted (value-weighed) strategies is intriguing and points to

GPT’s ability to uncover value in smaller stocks.

8 Conclusion

In this paper, we probe the limits of large language models by providing novel evidence

on their ability to analyze ﬁnancial statements. Financial statement analysis is a traditional

quantitative task that requires, critical thinking, reasoning, and judgment. Our approach

involves providing the model with structured and anonymized ﬁnancial statements and a

sophisticated chain-of-thought prompt that mimics how human analysts process ﬁnancial

information. We speciﬁcally do not provide any narrative information.

Our results suggest that GPT’s analysis yields useful insights about the company, which

enable the model to outperform professional human analysts in predicting the direction of

future earnings. We also document that GPT and human analysts are complementary, rather

than substitutes. Speciﬁcally, language models have a larger advantage over human analysts

when analysts are expected to exhibit bias and disagreement, suggesting that AI models

can assist humans better when they are under-performing. Humans, on the other hand add

value when additional context, not available to the model is likely to be important.

Furthermore and surprisingly, GPT’s performance is on par (or even better in some cases)

with that of the most sophisticated narrowly specialized machine learning models, namely,

an ANN trained on earnings prediction tasks. We investigate potential sources of the LLM’s

superior predictive power. We ﬁrst rule out that the model’s performance stems from its

memory. Instead, our analysis suggests that the model draws its inference by gleaning use-

ful insights from its analysis of trends and ﬁnancial ratios and by leveraging its theoretical

Financial Statement Analysis with Large Language Models 30

understanding and economic reasoning. Notably, the narrative ﬁnancial statement analy-

sis generated by the language model has substantial informational value in its own right.

Building on these ﬁndings, we also present a proﬁtable trading strategy based on GPT’s

predictions. The strategy yields higher Sharpe ratios and alphas than other trading strate-

gies based on ML models. Overall, our analysis suggests that GPT shows a remarkable

aptitude for ﬁnancial statement analysis and achieves state-of-the-art performance without

any specialized training.

Although one must interpret our results with caution, we provide evidence consistent

with large language models having human-like capabilities in the ﬁnancial domain. General-

purpose language models successfully perform a task that typically requires human expertise

and judgment and do so based on data exclusively from the numeric domain. Therefore, our

ﬁndings indicate the potential for LLMs to democratize ﬁnancial information processing and

should be of interest to investors and regulators. For example, our results suggest that

generative AI is not merely a tool that can assist investors (e.g., in summarizing ﬁnancial

statements, Kim et al., 2023b), but can play a more active role in making informed decisions.

This ﬁnding is signiﬁcant, as unsophisticated investors might be prone to ignoring relevant

signals (e.g., Blankespoor et al., 2019), even if they are generated by advanced AI tools.

However, whether AI can substantially improve human decision-making in ﬁnancial markets

in practice is still to be seen. We leave this question for future research. Finally, even though

we strive to understand the sources of model predictions, it is empirically diﬃcult to pinpoint

how and why the model performs well.

Financial Statement Analysis with Large Language Models 31

References

Abarbanell, J.S., Bernard, V.L., 1992. Tests of analysts’ overreaction/underreaction to earnings

information as an explanation for anomalous stock price behavior. The Journal of Finance 47,

1181–1207.

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Al-

tenschmidt, J., Altman, S., Anadkat, S., et al., 2023. Gpt-4 technical report. arXiv preprint

arXiv:2303.08774 .

Ashbaugh-Skaife, H., Collins, D.W., Kinney Jr, W.R., 2007. The discovery and reporting of internal

control deﬁciencies prior to sox-mandated audits. Journal of Accounting and Economics 44, 166–

192.

Ball, R., Foster, G., 1982. Corporate ﬁnancial reporting: A methodological review of empirical

research. Journal of Accounting Research , 161–234.

Ball, R., Kothari, S., Robin, A., 2000. The eﬀect of international institutional factors on properties

of accounting earnings. Journal of Accounting and Economics 29, 1–51.

Ball, R., Sadka, G., Tseng, A., 2022. Using accounting earnings and aggregate economic indicators

to estimate ﬁrm-level systematic risk. Review of Accounting Studies 27, 607–646.

Basu, S., Markov, S., 2004. Loss function assumptions in rational expectations tests on ﬁnancial

analysts’ earnings forecasts. Journal of Accounting and Economics 38, 171–203.

Bernard, D., Blankespoor, E., de Kok, T., Toynbee, S., 2023. Confused readers: A modular measure

of business complexity. University of Washington Working Paper.

Blankespoor, E., Dehaan, E., Wertz, J., Zhu, C., 2019. Why do individual investors disregard

accounting information? the roles of information awareness and acquisition costs. Journal of

Accounting Research 57, 53–84.

Bochkay, K., Levine, C.B., 2019. Using md&a to improve earnings forecasts. Journal of Accounting,

Auditing & Finance 34, 458–482.

Bordalo, P., Gennaioli, N., Porta, R., Shleifer, A., 2019. Diagnostic expectations and stock returns.

The Journal of Finance 74, 2839–2874.

Bouwman, M.J., Frishkoﬀ, P.A., Frishkoﬀ, P., 1987. How do ﬁnancial analysts make decisions? a

process model of the investment screening decision. Accounting, Organizations and Society 12,

1–29.

Bradshaw, M.T., 2011. Analysts’ forecasts: what do we know after decades of work? Working

Paper .

Bradshaw, M.T., Drake, M.S., Myers, J.N., Myers, L.A., 2012. A re-examination of analysts’

superiority over time-series forecasts of annual earnings. Review of Accounting Studies 17, 944–

968.

Brown, L.D., Call, A.C., Clement, M.B., Sharp, N.Y., 2015. Inside the “black box” of sell-side

ﬁnancial analysts. Journal of Accounting Research 53, 1–47.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam,

P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. Advances in neural

information processing systems 33, 1877–1901.

Bybee, J.L., 2023. The ghost in the machine: Generating beliefs with large language models. arXiv

preprint arXiv:2305.02823 .

Cao, S., Jiang, W., Wang, J.L., Yang, B., 2024. From man vs. machine to man+ machine: The art

and ai of stock analyses. Journal of Financial Economics Forthcoming.

Chen, X., Cho, Y.H., Dou, Y., Lev, B., 2022. Predicting future earnings changes using machine

Financial Statement Analysis with Large Language Models 32

learning and detailed ﬁnancial data. Journal of Accounting Research 60, 467–515.

Choi, G.Y., Kim, A., 2023. Economic footprints of tax audits: A generative ai-driven approach.

Chicago Booth Research Paper .

Choi, J.H., Hickman, K.E., Monahan, A.B., Schwarcz, D., 2022. Chatgpt goes to law school. J.

Legal Educ. 71, 387.

Costello, A.M., Down, A.K., Mehta, M.N., 2020. Machine+ man: A ﬁeld experiment on the role

of discretion in augmenting ai-based lending models. Journal of Accounting and Economics 70,

101360.

De Bondt, W.F., Thaler, R.H., 1990. Do security analysts overreact? The American Economic

Review , 52–57.

Dell’Acqua, F., McFowland, E., Mollick, E.R., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S.,

Krayer, L., Candelon, F., Lakhani, K.R., 2023. Navigating the jagged technological frontier:

Field experimental evidence of the eﬀects of ai on knowledge worker productivity and quality.

HBS Working Paper .

Donelson, D.C., Resutek, R.J., 2015. The predictive qualities of earnings volatility and earnings

uncertainty. Review of Accounting Studies 20, 470–500.

Eulerich, M., Sanatizadeh, A., Vakilzadeh, H., Wood, D.A., 2023. Can artiﬁcial intelligence pass

accounting certiﬁcation exams? chatgpt: Cpa, cma, cia, and ea? ChatGPT: CPA, CMA, CIA,

and EA .

Fama, E.F., French, K.R., 2015. A ﬁve-factor asset pricing model. Journal of Financial Economics

116, 1–22.

Ge, W., McVay, S., 2005. The disclosure of material weaknesses in internal control after the

sarbanes-oxley act. Accounting Horizons 19, 137–158.

Glasserman, P., Lin, C., 2023. Assessing look-ahead bias in stock return predictions generated by

gpt sentiment analysis. arXiv preprint arXiv:2309.17322 .

Graham, B., Dodd, D.L., 1934. Security Analysis. Whittlesey House, McGraw-Hill Book Co., New

York.

Green, J., Hand, J.R., Zhang, X.F., 2016. Errors and questionable judgments in analysts’ dcf

models. Review of Accounting Studies 21, 596–632.

Groysberg, B., Healy, P.M., Maber, D.A., 2011. What drives sell-side analyst compensation at

high-status investment banks? Journal of Accounting Research 49, 969–1000.

Hansen, A.L., Kazinnik, S., 2023. Can chatgpt decipher fedspeak? Available at SSRN .

Hunt, J.O., Myers, J.N., Myers, L.A., 2022. Improving earnings predictions and abnormal returns

with machine learning. Accounting Horizons 36, 131–149.

Hutton, A.P., Lee, L.F., Shu, S.Z., 2012. Do managers always know better? the relative accuracy

of management and analyst forecasts. Journal of Accounting Research 50, 1217–1244.

Hwang, L.S., Jan, C.L., Basu, S., 1996. Loss ﬁrms and analysts’ earnings forecast errors. The

Journal of Financial Statement Analysis 1.

Jiang, J., Kelly, B.T., Xiu, D., 2022. Expected returns and large language models. Available at

SSRN .

Kahneman, D., 2011. Thinking, fast and slow. macmillan.

Kim, A., Muhn, M., Nikolaev, V., 2023a. From transcripts to insights: Uncovering corporate risks

using generative ai. arXiv preprint arXiv:2310.17721 .

Kim, A., Muhn, M., Nikolaev, V.V., 2023b. Bloated disclosures: can chatgpt help investors process

information? Chicago Booth Research Paper .

Kim, A., Nikolaev, V.V., 2023a. Context-based interpretation of ﬁnancial information. Chicago

Financial Statement Analysis with Large Language Models 33

Booth Research Paper .

Kim, A.G., Nikolaev, V.V., 2023b. Proﬁtability context and the cross-section of stock returns.

Chicago Booth Research Paper .

Kothari, S., So, E., Verdi, R., 2016. Analysts’ forecasts and asset pricing: A survey. Annual Review

of Financial Economics 8, 197–219.

Li, E.X., Tu, Z., Zhou, D., 2023. The promise and peril of generative ai: Evidence from chatgpt as

sell-side analysts. Available at SSRN 4480947 .

Liu, M., 2022. Assessing human information processing in lending decisions: A machine learning

approach. Journal of Accounting Research 60, 607–651.

Lopez-Lira, A., Tang, Y., 2023. Can chatgpt forecast stock price movements? return predictability

and large language models. arXiv preprint arXiv:2304.07619 .

Monahan, S.J., et al., 2018. Financial statement analysis and earnings forecasting. Foundations

and Trends® in Accounting 12, 105–215.

Novy-Marx, R., 2013. The other side of value: The gross proﬁtability premium. Journal of ﬁnancial

economics 108, 1–28.

Ou, J.A., Penman, S.H., 1989. Financial statement analysis and the prediction of stock returns.

Journal of Accounting and Economics 11, 295–329.

Penman, S.H., 2001. On comparing cash ﬂow and accrual accounting models for use in equity

valuation: A response to lundholm and o’keefe (car, summer 2001). Contemporary Accounting

Research 18, 681–692.

Penman, S.H., Sougiannis, T., 1998. A comparison of dividend, cash ﬂow, and earnings approaches

to equity valuation. Contemporary accounting research 15, 343–383.

Piotroski, J.D., 2000. Value investing: The use of historical ﬁnancial statement information to

separate winners from losers. Journal of Accounting Research , 1–41.

Sarkar, S.K., Vafa, K., 2024. Lookahead bias in pretrained language models. Available at SSRN .

Sloan, R.G., 2019. Fundamental analysis redux. The Accounting Review 94, 363–377.

Stickel, S.E., 1991. Common stock returns surrounding earnings forecast revisions: More puzzling

evidence. The Accounting Review , 402–416.

Watts, R.L., Zimmerman, J.L., 1986. Positive accounting theory .

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al., 2022.

Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural

Information Processing Systems 35, 24824–24837.

Financial Statement Analysis with Large Language Models 34

Appendix A. Time Series of GPT’s Prediction Accuracy

This table shows time-series prediction accuracy and F1 scores of GPT and ANN. The last two columns are the diﬀerences

between the two models (GPT - ANN). Time trend is obtained by regressing accuracy metrics on ﬁscal years, obtaining robust

standard errors at the year level. *, **, and *** denote statistical signiﬁcance at 10%, 5%, and 1% levels, respectively.

GPT ANN Diﬀ

Fiscal Year Accuracy F1 Accuracy F1 Accuracy F1

1968 58.55% 67.19% 58.48% 67.45% 0.07% -0.26%

1969 59.23% 59.85% 58.71% 59.32% 0.52% 0.53%

1970 55.51% 58.86% 55.27% 58.66% 0.24% 0.20%

1971 60.29% 70.33% 59.73% 69.89% 0.56% 0.44%

1972 72.96% 81.58% 71.26% 80.62% 1.70% 0.96%

1973 67.53% 74.97% 66.84% 74.60% 0.69% 0.37%

1974 57.32% 63.10% 55.93% 61.93% 1.39% 1.17%

1975 58.29% 67.21% 57.93% 66.89% 0.36% 0.32%

1976 68.31% 77.63% 68.00% 77.42% 0.31% 0.21%

1977 69.14% 78.30% 68.64% 77.96% 0.50% 0.34%

1978 69.84% 78.73% 69.26% 78.27% 0.58% 0.46%

1979 61.90% 67.70% 60.96% 66.84% 0.94% 0.86%

1980 61.98% 64.72% 61.04% 63.97% 0.94% 0.75%

1981 61.00% 58.45% 59.78% 57.37% 1.22% 1.08%

1982 58.89% 58.04% 57.69% 56.75% 1.20% 1.29%

1983 66.32% 71.88% 64.96% 70.93% 1.36% 0.95%

1984 57.82% 59.54% 56.45% 58.14% 1.37% 1.40%

1985 58.19% 53.48% 56.44% 51.33% 1.75% 2.15%

1986 60.45% 60.82% 58.92% 59.48% 1.53% 1.34%

1987 63.85% 69.07% 62.45% 67.90% 1.40% 1.17%

1988 61.01% 63.50% 59.84% 62.32% 1.17% 1.18%

1989 60.36% 59.25% 59.39% 58.30% 0.97% 0.95%

1990 59.16% 57.50% 58.78% 57.08% 0.38% 0.42%

1991 60.04% 58.91% 59.63% 58.29% 0.41% 0.62%

1992 58.64% 61.49% 58.09% 60.75% 0.55% 0.74%

1993 66.79% 70.03% 62.25% 66.47% 4.54% 3.56%

1994 64.58% 69.70% 63.48% 68.69% 1.10% 1.01%

1995 60.84% 65.53% 59.87% 64.13% 0.97% 1.40%

1996 63.96% 67.09% 63.00% 65.88% 0.96% 1.21%

1997 56.38% 56.57% 55.40% 55.04% 0.98% 1.53%

1998 56.24% 58.79% 54.46% 56.38% 1.78% 2.41%

1999 62.39% 64.27% 61.08% 62.82% 1.31% 1.45%

2000 55.27% 51.77% 54.22% 50.66% 1.05% 1.11%

2001 56.65% 54.92% 56.02% 54.38% 0.63% 0.54%

2002 56.51% 62.80% 55.68% 62.24% 0.83% 0.56%

2003 59.94% 66.97% 59.92% 67.22% 0.02% -0.25%

2004 60.59% 67.68% 60.13% 67.54% 0.46% 0.14%

2005 60.17% 65.35% 59.69% 64.76% 0.48% 0.59%

2006 61.36% 63.24% 60.69% 62.46% 0.67% 0.78%

2007 60.73% 52.21% 60.27% 52.32% 0.46% -0.11%

2008 51.06% 43.49% 50.62% 43.33% 0.44% 0.16%

2009 48.86% 51.61% 48.57% 51.76% 0.29% -0.15%

2010 59.28% 66.29% 58.68% 65.98% 0.60% 0.31%

2011 57.52% 62.11% 57.36% 62.02% 0.16% 0.09%

2012 60.09% 63.62% 59.36% 62.49% 0.73% 1.13%

2013 60.81% 63.28% 60.21% 62.49% 0.60% 0.79%

2014 59.54% 59.75% 59.03% 59.04% 0.51% 0.71%

2015 59.76% 60.33% 59.18% 59.09% 0.58% 1.24%

2016 58.49% 59.44% 57.91% 58.77% 0.58% 0.67%

2017 59.35% 64.70% 58.41% 64.15% 0.94% 0.55%

2018 58.40% 60.31% 57.75% 59.52% 0.65% 0.79%

2019 59.01% 49.73% 58.35% 49.50% 0.66% 0.23%

2020 50.25% 55.24% 49.72% 54.41% 0.53% 0.83%

2021 54.36% 55.13% 53.54% 59.84% 0.82% -4.71%

Trend -0.001*** -0.002*** -0.001*** -0.002*** -0.000 -0.000

(-3.17) (-3.95) (-3.13) (-3.71) (-1.49) (-1.12)

Financial Statement Analysis with Large Language Models 35

Appendix B. Example Balance Sheet and Income Statement

Panel A and Panel B show an example of standardized, anonymous balance sheet and income statement. We use Compustat’s

balancing formula and delete ﬁscal years.

Panel A. Balance Sheet

Panel B. Income Statement

Financial Statement Analysis with Large Language Models 36

Appendix C. Example Output

We present one example output by GPT. GPT has rendered a prediction of “increase” with a moderate magnitude, and a

prediction certainty of 0.7. The correct prediction is “increase.” Panel A shows the trend analysis results, Panel B shows ratio

analysis results, and Panel C shows the rationale.

Panel A. Trend Analysis

Panel B. Ratio Analysis

Panel C. Rationale

Financial Statement Analysis with Large Language Models 37

Appendix D. GPT’s Guess About Fiscal Years

In Table 6, we show that the accuracy of GPT’s ﬁscal year guesses is 2.95%. Our sample

spans the period 1968-2021, and one might have concern that a pure random guess leads to

a probability of 1.85%, which is lower than GPT’s accuracy. However, given the distribution

in GPT’s answers and the distribution in our universe, this is not the case.

We observe that out of 10,000 random samples, GPT’s ﬁscal year predictions only

give years 2001 (0.02%), 2008 (0.47%), 2018 (0.02%), 2019 (3.50%), 2020 (32.60%), 2021

(63.31%), and 2023 (0.09%). GPT’s inability to produce balanced guesses already suggests

that it cannot make informed guesses about ﬁscal years. However, to test this more formally,

assume that we randomly draw a sample from the universe and that its ﬁscal year is i. When

i is not included in 2001, 2008, 2018, 2019, 2020, 2021, and 2023, the probability that GPT

will guess the correct year is zero. If i is 2001, the probability that GPT will guess the

correct year is 0.02% and if i is 2021, the probability increases to 63.31%. Now deﬁne p

the probability that GPT will guess the correct ﬁscal year given year i.

One more thing that we should consider is that our universe is not a balanced panel. Our

data is sparse in earlier years and more dense in recent years. Fiscal year 2021, for instance,

account for 3.5% of the total observations. Let q

be the proportion of ﬁscal year i in the

entire sample. Then, the expected probability that a random draw from the population leads

to a correct guess by GPT is

P rob =

2021

i=1968

This value is 3.3%, which is higher than 2.95%, the value we report in Table 6.

Financial Statement Analysis with Large Language Models 38

Figure 1. GPT Processing Details

This ﬁgure illustrates the structure of our experiment. Using raw data from Compustat annual, we construct standardized

balance sheet and income statement using Compustat’s balancing formulae. Then, we substitute ﬁscal years with relative ﬁscal

years t, t − 1, and t − 2. We then provide these anonymous, standardized ﬁnancial statements to GPT 4 Turbo with detailed

chain-of-thought prompts. The model is instructed to provide notable trends, ﬁnancial ratios, and their interpretations. Final

predictions are binary (increase or decrease) along with a paragraph of rationale. We also instruct the model to produce the

predicted magnitude of earnings change and the conﬁdence of its answers.

Financial Statement Analysis with Large Language Models 39

Figure 2. GPT vs. Human Analysts

This ﬁgure compares the prediction performance of GPT and human analysts. Random Walk is based on the current earnings

change compared to the previous earnings. Analyst 1m (3m, 6m) denotes the median analyst forecast issued one (three, six)

month(s) after the earnings release. GPT (wihtout CoT) denotes GPT’s predictions without any chain-of-thought prompts.

We simply provide the model with structured and anonymous ﬁnancial statement information. GPT (with CoT) denotes the

model with ﬁnancial statement information and detailed chain-of-thought prompts. We report average accuracy (the percentage

of correct predictions out of total predictions) for each method (left) and F1 score (right). We obtain bootstrapped standard

errors by randomly sampling 1,000 observations 1,000 times and include 95% conﬁdence intervals.

Financial Statement Analysis with Large Language Models 40

Figure 3. GPT vs. Machine Learning Models

This ﬁgure compares the prediction performance of GPT and quantitative models based on machine learning. Stepwise Logistic

follows Ou and Penman (1989)’s structure with their 59 ﬁnancial predictors. ANN is a three-layer artiﬁcial neural network model

using the same set of variables as in Ou and Penman (1989). GPT (with CoT) provides the model with ﬁnancial statement

information and detailed chain-of-thought prompts. We report average accuracy (the percentage of correct predictions out of

total predictions) for each method (left) and F1 score (right). We obtain bootstrapped standard errors by randomly sampling

1,000 observations 1,000 times and include 95% conﬁdence intervals.

Financial Statement Analysis with Large Language Models 41

Figure 4. Time Trend in Prediction Accuracy

This ﬁgure illustrates the time trend in GPT’s prediction accuracy (left) and the diﬀerence in GPT and ANN’s prediction

accuracy (right). Left panel demonstrates annual accuracy of GPT’s predictions. The dotted line represents ﬁtted time-trend.

In right panel, we compute the diﬀerence between GPT’s and ANN’s prediction accuracy each year (GPT’s accuracy - ANN’s

accuracy). The dotted line represents ﬁtted time-trend.

Financial Statement Analysis with Large Language Models 42

Figure 5. Diﬀerent GPT Speciﬁcations

This ﬁgure compares the model performance depending on several experimental settings. The ﬁrst four bars are based on GPT’s

answers on its conﬁdence and the averaged token-level log probabilities. The ﬁfth and six bars are the predicted magnitude in

earnings change. The last two columns compare the prediction accuracy of GPT 4 and GPT 3.5. We use a random 20% sample

for the last two columns. We obtain bootstrapped standard errors by randomly sampling 1,000 observations 1,000 times and

include 95% conﬁdence intervals.

Financial Statement Analysis with Large Language Models 43

Figure 6. GPT’s Memory

This ﬁgure shows the experiment results to test GPT’s memory. We ask GPT to produce ten most probable company names

and the most probable ﬁscal year from the standardized, anonymous ﬁnancial statements. Left panel shows the ten most

frequent company names in GPT’s answers, and right panel plots the actual ﬁscal years (vertical axis) and predicted ﬁscal years

(horizontal axis).

Financial Statement Analysis with Large Language Models 44

Figure 7. Sources of Prediction

This ﬁgure shows descriptive bigram (monogram) frequency counts of GPT answers. Left panel shows the ten most frequently

used bigrams in GPT’s answers on the ﬁnancial ratio analysis. Right panel shows the ten most frequently used monograms in

GPT’s answers on the rationale behind its binary earnings prediction.

Financial Statement Analysis with Large Language Models 45

Figure 8. Equal-Weight Portfolio Cumulative Returns Over Time

This ﬁgure shows cumulative log returns from 1968 to 2021 of the long-short strategies based on GPT predictions. We form

equal-weight portfolios on June 30 of each year and hold them for one year. We long the top decile of stocks that are classiﬁed

as “increase” in earnings prediction, “large” or “moderate” in magnitude, based on their log probabilities. Similarly we short

the top decile of stocks that are classiﬁed as “decrease” in earnings prediction, “large” or “moderate” in magnitude, based on

their log probabilities. Left panel shows the cumulative log returns of long and short portfolios. Right panel demonstrates

long-short returns and the market returns.

Financial Statement Analysis with Large Language Models 46

Table 1. Descriptive Statistics

This table shows descriptive statistics of the variables used in analyses. Panel A uses the entire universe of Compustat and

Panel B uses the intersection between I/B/E/S and Compustat. For Panel B, we require that each observation has at least three

analyst forecasts issued. P red X denotes an indicator variable that equals one when method X predicts an increase in earnings

and zero otherwise. T arget is an indicator that equals one when earnings increase in the next period and zero otherwise. Size is

the log of total assets, BtoM is book-to-market ratio, Leverage is total debt over total asset, Earnings Volatility is the standard

deviation of earnings over the past ﬁve years scaled by total asset, and PP&E is net property, plant, and equipment scaled by

total asset.

Panel A. Full Sample (1968 – 2021)

N Mean Std P25 P50 P75

Target 150,678 0.555 0.497 0.000 1.000 1.000

Pred GPT 150,678 0.530 0.499 0.000 1.000 1.000

Pred Logit 150,678 0.591 0.493 0.000 1.000 1.000

Pred ANN 150,678 0.521 0.492 0.000 1.000 1.000

Pred Random 150,678 0.561 0.495 0.000 1.000 1.000

Size 133,830 6.135 2.298 4.441 6.074 7.717

BtoM 133,830 0.677 0.620 0.289 0.542 0.896

Leverage 133,830 0.549 0.267 0.348 0.553 0.740

Earnings Volatility 133,830 0.081 0.177 0.008 0.024 0.071

PP&E 133,830 0.281 0.270 0.042 0.195 0.454

Panel B. Analyst Sample (1983 – 2021)

N Mean Std P25 P50 P75

Target 39,533 0.563 0.496 0.000 1.000 1.000

Pred GPT 39,533 0.555 0.497 0.000 1.000 1.000

Pred Analyst1m 39,533 0.518 0.499 0.000 1.000 1.000

Pred Analyst3m 39,533 0.520 0.500 0.000 1.000 1.000

Pred Analyst6m 39,533 0.516 0.500 0.000 1.000 1.000

Pred Random 39,533 0.569 0.495 0.000 1.000 1.000

Size 37,736 7.541 1.897 6.138 7.504 8.813

BtoM 37,736 0.528 0.450 0.249 0.444 0.701

Leverage 37,736 0.580 0.256 0.400 0.582 0.771

Earnings Volatility 37,736 0.051 0.109 0.007 0.020 0.051

PP&E 37,736 0.257 0.260 0.038 0.155 0.418

Financial Statement Analysis with Large Language Models 47

Table 2. GPT vs. Human Analysts

This table reports prediction performance of the random walk model, analysts’ forecast issued one month after previous earnings

release (Analyst 1m), three months after previous earnings release (Analyst 3m), and six months after previous earnings release

(Analyst 6m). GPT (wihtout CoT) denotes GPT’s predictions without any chain-of-thought prompts. We simply provide the

model with structured and anonymous ﬁnancial statement information. GPT (with CoT) denotes the model with ﬁnancial

statement information and detailed chain-of-thought prompts. Accuracy is the percentage of correct predictions out of total

predictions. F1 is the harmonic mean of the precision and recall.

Accuracy F1

Random Walk 49.11% 53.02%

Analyst 1m 52.71% 54.48%

Analyst 3m 55.95% 55.33%

Analyst 6m 56.68% 56.85%

GPT (without CoT) 52.33% 54.52%

GPT (with CoT) 60.35% 60.90%

Financial Statement Analysis with Large Language Models 48

Table 3. Complementarities Between Human Analysts and GPT

*, **, and *** denote statistical signiﬁcance at 10%, 5%, and 1% levels, respectively.

In Panel A, we investigate the determinants of incorrect predictions. I(Incorrect = 1), which is an indicator that equals one

when the model makes incorrect predictions and zero otherwise. Independent variables are deﬁned in Table 1. All continuous

variables are winsorized at 1% and 99% level. Standard errors are clustered at the industry level. Column (1) uses GPT for

I(Incorrect = 1) and columns (2), (3), and (4) use analysts’ predictions. Panel B shows incremental informativeness of each

prediction. Both independent and dependent variables are indicators. I(Increase = 1) is an indicator that equals one when

actual earnings increase and zero otherwise. All independent variables are also indicators that equal one when respective method

predicts an increase in earnings and zero otherwise. Standard errors are clustered at the industry level. In Panel C, we partition

the sample based on analyst bias and dispersion. Bias is the forecasted portion of analysts’ forecast error and dispersion is

the standard deviation of analyst forecasts scaled by stock price at the end of prior ﬁscal year. Low and High denote ﬁrst and

fourth quartiles, respectively. F-test compares the magnitude of the coeﬃcients on columns (1) and (2), and (3) and (4).

Panel A. Determinants

Dep Var I(Incorrect=1)

GPT Analyst 1m Analyst 3m Analyst 6m

(1) (2) (3) (4)

Size -0.017*** -0.008*** -0.010*** -0.010***

(-5.16) (-5.72) (-4.69) (-4.81)

BtoM -0.022 -0.016*** -0.012** -0.012**

(-0.99) (-2.94) (-2.21) (-2.35)

Leverage -0.145 -0.032 -0.029 -0.029

(-1.50) (-0.37) (-1.40) (-1.36)

Loss 0.193*** 0.141*** 0.146*** 0.145***

(4.76) (7.02) (6.90) (6.09)

Earnings Volatility 0.236*** 0.169*** 0.160*** 0.132**

(2.69) (4.08) (3.46) (2.47)

PP&E 0.133* 0.041 0.036* 0.031

(1.67) (1.18) (1.71) (1.25)

Year FE Yes Yes Yes Yes

Industry FE Yes Yes Yes Yes

Adjusted R2 0.08 0.027 0.032 0.029

N 37,736 37,736 37,736 37,736

Panel B. Incremental Informativeness

Dep Var I(Increase=1)

(1) (2) (3) (4) (5) (6) (7)

GPT 0.182*** 0.170*** 0.151** 0.152**

(2.99) (2.67) (2.35) (2.30)

Analyst 1m 0.073*** 0.110**

(3.11) (2.43)

Analyst 3m 0.098*** 0.122***

(4.02) (3.49)

Analyst 6m 0.100*** 0.124***

(4.05) (3.62)

Year FE Yes Yes Yes Yes Yes Yes Yes

Industry FE Yes Yes Yes Yes Yes Yes Yes

Adjusted R2 0.07 0.025 0.043 0.044 0.089 0.091 0.091

N 37,736 37,736 37,736 37,736 37,736 37,736 37,736

Panel C. Human Bias and Dispersion

Dep Var I(Increase=1)

Bias Dispersion

Low High Low High

(1) (2) (3) (4)

GPT 0.075** 0.341*** 0.118** 0.301***

(2.21) (4.39) (2.50) (3.20)

Analyst 1m 0.175*** 0.093*** 0.187*** 0.058**

(8.54) (3.05) (6.59) (2.35)

F-Test on GPT p-value <0.01 p-value <0.01

F-Test on Analyst p-value <0.01 p-value <0.01

Year FE Yes Yes Yes Yes

Industry FE Yes Yes Yes Yes

Adjusted R2 0.057 0.134 0.071 0.115

N 9,410 9,396 9,448 10,093

Financial Statement Analysis with Large Language Models 49

Table 4. Comparison with ML Benchmarks

*, **, and *** denote statistical signiﬁcance at 10%, 5%, and 1% levels, respectively.

In Panel A, we compare the prediction performance of GPT and quantitative models based on machine learning. Stepwise

Logistic follows Ou and Penman (1989)’s structure with their 59 ﬁnancial predictors. ANN is a three-layer artiﬁcial neural

network model using the same set of variables as in Ou and Penman (1989). GPT (with CoT) provides the model with

ﬁnancial statement information and detailed chain-of-thought prompts. Accuracy is the percentage of correct predictions out

of total predictions. F1 is the harmonic mean of the precision and recall. In Panel B, we investigate the determinants of

incorrect predictions. I(Incorrect = 1), which is an indicator that equals one when the model makes incorrect predictions and

zero otherwise. Independent variables are deﬁned in Table 1. All continuous variables are winsorized at 1% and 99% level.

Standard errors are clustered at the industry level. Column (1) uses GPT for I(Incorrect = 1) and columns (2), (3), and (4)

use analysts’ predictions. Panel C shows incremental informativeness of each prediction. Both independent and dependent

variables are indicators. I(Increase = 1) is an indicator that equals one when actual earnings increase and zero otherwise.

All independent variables are also indicators that equal one when respective method predicts an increase in earnings and zero

otherwise. Standard errors are clustered at the industry level.

Panel A. Other Models

Accuracy F1

Stepwise Logistic 52.94% 57.23%

ANN (Ou and Penman (1989) variables) 60.45% 61.62%

ANN (Financial statement variables) 60.12% 61.30%

GPT (with CoT) 60.31% 63.45%

Panel B. Sources of Inaccuracy

Dep Var = I(Incorrect=1)

GPT ANN Stepwise Logistic

(1) (2) (3)

Size -0.015*** -0.024*** -0.029***

(-9.09) (-11.33) (-11.56)

BtoM 0.001 0.002 0.002

(0.38) (0.73) (0.69)

Leverage 0.092*** 0.085*** 0.090***

(6.30) (5.88) (6.02)

Loss 0.134*** 0.181*** 0.202***

(9.64) (11.35) (12.96)

Earnings Volatility 0.040** 0.062*** 0.078***

(2.09) (6.35) (8.02)

PP&E 0.027* 0.016 0.02

(1.95) (1.53) (1.69)

Year FE Yes Yes Yes

Industry FE Yes Yes Yes

Estimation OLS OLS OLS

Adjusted R2 0.097 0.102 0.109

N 133,830 133,830 133,830

Panel C. Incremental Informativeness

Dep Var I(Increase=1)

(1) (2) (3) (4) (5)

GPT 0.181*** 0.170*** 0.179***

(3.43) (2.67) (3.35)

ANN 0.150*** 0.053**

(3.69) (2.44)

Logistic 0.088*** 0.068**

(2.99) (2.05)

Year FE Yes Yes Yes Yes Yes

Industry FE Yes Yes Yes Yes Yes

Adjusted R2 0.056 0.051 0.032 0.061 0.06

N 133,830 133,830 133,830 133,830 133,830

Financial Statement Analysis with Large Language Models 50

Table 5. Experimental Variations and GPT’s Predictability

We compare the predictive performance of the model based on several experimental settings. Conf Score is the conﬁdence score

(ranging from 0 to 1) that the model produces. Conﬁdence score measures how certain the model is in its answers. Log Prob is

the averaged token-level logistic probabilities. High and Low in columns (1), (2), (3), and (4) denote ﬁrst and fourth quartiles,

respectively. Magnitude is the predicted magnitude in earnings change provided by the model. LLM Version denotes the family

of LLM that we use for the experiment. Accuracy is the percentage of correct predictions out of total predictions. F1 is the

harmonic mean of the precision and recall. In Panel B, we report the model performance of an ANN model based on text

embedding. We use BERT-base-uncased model to extract contextualized embedding representation of the narrative ﬁnancial

statement analysis performed by the model. The input layer as 768 dimensions, two hidden layers have 256 and 64 dimensions

each, and the ﬁnal layer has one dimension. We use ReLU activation function in the ﬁrst two transitions and sigmoid for the last

transition. Batch size is 128. We use Adam optimizer and binary cross-entropy loss. The model is trained on rolling ﬁve-year

training windows and hyper-parameters (learning rate and dropout) are determined based on a grid-search on the random 20%

of the training sample. ANN with Financial Statement Variables denotes the model in Table 4, Panel A. AUC denotes the

area-under-the-curve.

Conf Score Log Prob Magnitude LLM Version

High Low High Low Large Small GPT4.0 GPT3.5 Gemini

(1) (2) (3) (4) (5) (6) (7) (8) (9)

Accuracy 62.44% 59.86% 63.15% 58.54% 62.03% 60.22% 61.05% 52.29% 59.15%

F1 66.47% 55.62% 65.16% 54.15% 61.16% 57.95% 65.82% 59.17% 62.23%

Financial Statement Analysis with Large Language Models 51

Table 6. Memory of GPT

In this table, we test GPT’s memory. For Panel A and Panel B, we ask GPT to provide ten most probable names of the

company and the most probable ﬁscal year, based on the standardized and anonymous ﬁnancial statement information. In

Panel A, we do not provide chain-of-thought prompts and in Panel B, we provide the same chain-of-thought prompts as in the

main analyses. In Panel C, we repeat the main analyses for ﬁscal year 2022 to predict 2023 earnings. GPT’s training window

terminates in April 2023 and our sample period provides a perfect out-of-sample test. Accuracy is the percentage of correct

predictions out of total predictions. F1 is the harmonic mean of the precision and recall

Panel A. Without Chain-of-Thought

Accuracy F1 Score

Firm Name 0.07% 0.07%

Year 2.95% 0.41%

Panel B. Chain-of-Thought

Accuracy F1 Score

Firm Name 0.09% 0.09%

Year 1.01% 0.27%

Panel C. Out-of-Sample (Using 2022 data to predict ﬁscal year 2023)

Accuracy F1 Score

Logistic Regression 54.47% 60.60%

ANN 59.10% 61.13%

GPT 58.96% 63.91%

Analyst 53.06% 58.95%

Financial Statement Analysis with Large Language Models 52

Table 7. Predictive Ability of GPT-Generated Texts

We report the model performance of an ANN model based on text embedding. We use BERT-base-uncased model to extract

contextualized embedding representation of the narrative ﬁnancial statement analysis performed by the model. The input layer

as 768 dimensions, two hidden layers have 256 and 64 dimensions each, and the ﬁnal layer has two dimensions (probability

vector). We use ReLU activation function in the ﬁrst two transitions and sigmoid for the last transition. Batch size is 128. We

use Adam optimizer and cross-entropy loss. The model is trained on rolling ﬁve-year training windows and hyper-parameters

(learning rate and dropout) are determined based on a grid-search on the random 20% of the training sample. ANN with

Financial Statement Variables denotes the model in Table 4, Panel A. ANN with Text and FS variables denotes a model that

allows full non-linear interactions among embedding neurons and FS variables. ANN with Adjusted Text Embedding denotes

models with adjusted text inputs. GPT produces three main textual outputs - trend, ratio, and rationale. ANN excl. Trend

denotes an ANN with an input embedding with only ratio and rationale analyses. ANN excl. Ratio and ANN excl. Rationale

are deﬁned likewise. AUC denotes the area-under-the-curve.

Accuracy F1 Score AUC

ANN with GPT Text Embedding 58.95% 65.26% 64.22%

ANN with Financial Statement Variables 60.12% 61.30% 59.13%

ANN with Text and FS Variables 63.16% 66.33% 65.90%

ANN with Adjusted Text Embedding

ANN excl. Trend 57.11% 64.03% 63.81%

ANN excl. Ratio 55.65% 62.36% 61.89%

ANN excl. Rationale 58.88% 65.15% 64.16%

Financial Statement Analysis with Large Language Models 53

Table 8. Asset Pricing Implications

*, **, and *** denote statistical signiﬁcance at 10%, 5%, and 1% levels, respectively.

In this table, we show asset pricing implications of GPT’s predictions. We form portfolios on June 30 of each year and hold

the portfolios for one year. To form portfolios based on GPT’s predictions, for each ﬁscal year, we choose stocks with a binary

prediction of “increase” and a magnitude prediction of either “moderate” or ”large.” Then we sort those stocks on descending

average log probability values. From this selected subset of stocks, we long stocks equivalent to 10% of the entire stocks

available in the given ﬁscal year from those ranked highest in log probability. We also do the same for the stocks with a binary

prediction of “decrease.” We ﬁlter stocks with a predicted magnitude change of either “moderate” or ”large”, and sort them on

log probability values. For ANN and logit, we sort the stocks on the predicted probability values of earnings increase. Then on

June 30, we long stocks in the top decile and short stocks in the bottom decile. Panel A reports monthly Sharpe ratio. Panel

B reports alphas based on CAPM, three-factor, four-factor, ﬁve-factor, and six-factor (ﬁve factors plus momentum).

Panel A. Sharpe Ratios (monthly)

Equal-Weighted Value-Weighted

(1) (2) (3) (4) (5) (6)

High Low H-L High Low H-L

GPT Predictions

Ret 1.72 0.44 1.28 1.04 0.48 0.56

Std 0.59 0.68 0.38 0.52 0.69 0.38

Sharpe 2.92 0.65 3.36 2.00 0.70 1.47

ANN

Ret 1.40 0.51 0.89 1.11 0.59 0.52

Std 0.72 0.67 0.35 0.61 0.88 0.29

Sharpe 1.94 0.76 2.54 1.82 0.67 1.79

Logit

Ret 1.38 0.50 0.88 1.04 0.62 0.42

Std 0.61 0.65 0.43 0.55 0.77 0.52

Sharpe 2.26 0.77 2.05 1.89 0.81 0.81

Panel B. Alphas (monthly)

Equal-Weighted Value-Weighted

CAPM 3 Factor 4 Factor 5 Factor 5F+Mom CAPM 3 Factor 4 Factor 5 Factor 5F+Mom

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

GPT

High 1.03*** 1.04*** 1.05*** 1.02*** 1.03*** 0.48*** 0.58*** 0.61*** 0.52*** 0.55***

(10.47) (16.28) (16.21) (15.67) (14.78) (5.93) (7.82) (8.08) (6.78) (7.35)

Low -0.20 -0.29** -0.24** 0.05 0.19* -0.23 -0.42*** -0.28* -0.04 0.18

(-1.24) (-2.46) (-2.03) (0.40) (1.65) (-1.45) (-2.91) (-1.96) (-0.27) (1.34)

H – L 1.23*** 1.33*** 1.29*** 0.97*** 0.84*** 0.71*** 1.00*** 0.89*** 0.56*** 0.37**

(8.96) (10.48) (10.10) (3.14) (4.48) (3.81) (6.27) (5.57) (3.78) (2.43)

ANN

High 0.98*** 1.00*** 0.99*** 0.85*** 0.82*** 0.55*** 0.59*** 0.65*** 0.56*** 0.52***

(8.35) (14.23) (11.32) (10.06) (8.55) (6.30) (8.16) (9.34) (6.95) (6.88)

Low -0.13 -0.23*** 0.02 0.16 0.22 -0.35* -0.49*** -0.23 -0.16 0.02

(-1.16) (-2.99) (0.32) (0.62) (1.53) (-1.73) (-3.19) (-1.16) (-0.89) (0.66)

H – L 1.11*** 1.23*** 0.97*** 0.69** 0.60* 0.90*** 1.08*** 0.88*** 0.72*** 0.50***

(7.62) (11.32) (9.38) (2.15) (1.89) (4.23) (7.99) (6.00) (4.56) (3.19)

Logit

High 0.89*** 0.90*** 0.86*** 0.71*** 0.68*** 0.40*** 0.44*** 0.46*** 0.36*** 0.33**

(7.15) (9.11) (9.25) (6.11) (4.23) (4.15) (4.22) (4.44) (2.86) (2.15)

Low -0.18 -0.26* -0.05 0.23 0.25 -0.24 -0.36** -0.29* -0.11 0.02

(-1.23) (-1.95) (-0.26) (1.06) (1.10) (-1.55) (-2.26) (-1.77) (-0.95) (0.05)

H – L 1.07*** 1.16*** 0.91*** 0.48* 0.43* 0.64** 0.80** 0.75** 0.47 0.31

(6.50) (8.15) (7.19) (2.06) (1.96) (2.35) (2.56) (2.41) (1.67) (1.55)