Starting today, we’re releasing new tools for multimodal financial analysis within Amazon SageMaker JumpStart. SageMaker JumpStart helps you quickly and easily get started with machine learning (ML) and provides a set of solutions for the most common use cases that can be trained and deployed readily with just a few clicks. You can now access a collection of multimodal financial text analysis tools, including example notebooks, text models, and solutions.
With these new tools, you can enhance your tabular ML workflows with new insights from financial text documents and potentially help save up to weeks of development time. With the new SageMaker JumpStart Industry SDK, you can easily retrieve common public financial documents, including SEC filings, and further process financial text documents with features such as summarization and scoring of the text for various attributes, such as sentiment, litigiousness, risk, and readability. In addition, you can access pre-trained language models trained on financial text for transfer learning, and use example notebooks for data retrieval, text feature engineering, multimodal classification and regression models.
In this post, we show how to curate a dataset of SEC filings and financial variables, use natural language processing (NLP) for feature engineering on the dataset, and undertake multimodal ML to build a better ratings classifier.
The new financial analysis features include an example notebook that demonstrates APIs to retrieve parsed SEC filings, APIs for summarizers, and APIs to score text for various attributes (see SageMaker JumpStart SEC Filings Retrieval w/Summarizer and Scoring). A second notebook (Multi-category ML on SEC Filings Data) demonstrates multicategory classification on SEC filings. A third notebook (ML on a TabText (Multimodal) Dataset) shows how to undertake ML on multimodal financial data using the Paycheck Protection Program (PPP) as an example. Four additional text models (RoBERTa-SEC-Base, RoBERTa-SEC-WIKI-Base, RoBERTa-SEC-Large, and RoBERTa-SEC-WIKI-Large) are provided to generate embeddings for transfer learning using pre-trained financial models that have been trained on Wiki text and 10 years of SEC filings.
Finally, a SageMaker JumpStart solution (Corporate Credit Rating Prediction) demonstrates how to use the pipeline of SEC filings (long-form text data) and financial ratios (tabular data) to build corporate credit rating prediction models. This is the model discussed in this post, which is the first in a series of posts that describe these new financial analysis ML tools. In this post, we explain how you can use this solution for credit scoring, which is fully customizable so you can accelerate your ML journey.
Credit assessment using ML: SageMaker JumpStart solution
We’re all familiar with individual credit scoring, especially our own credit scores, from FICO. In this notebook, we revisit the oldest and one of the most widely used models for corporate credit scoring, the Altman Z-score. The Altman model generates a credit score, where higher scores denote higher credit quality and lower scores denote lower quality firms.
Altman developed his model in 1968, using just 66 firms’ data to fit an accurate bankruptcy prediction model. It predicted which firms would default within 1 year. Altman fit this model using Linear Discriminant Analysis (LDA), arguably the first instance of the use of an ML algorithm in academic finance. This seminal paper has generated a family of Altman Z-score models that are used all over the globe. The model only requires a few inputs from a company’s financials and therefore may be applied to public and private firms, small and large. It’s in widespread use today. It uses tabular data.
In this post, you learn how to use a credit scoring model such as Altman’s Z-score, and enhance the model with financial text from SEC filings. The entire model is presented in the SageMaker JumpStart solution model card titled Corporate Credit Rating Prediction.
The preceding model card appears in SageMaker JumpStart. You can access this model card through SageMaker Studio.
Navigate to that card and deploy the model by choosing Launch.
The following page appears.
You can see a model that is deployed for inference and an endpoint. Wait until they’re ready and show the status Complete. Choose Open Notebook to open the first notebook, which is for training and endpoint deployment. You can work through this notebook to learn how to use this solution and then modify it for any other application you may want on your own data. The solution comes with synthetic data and uses a subset of it to exemplify the steps needed to train the model, deploy it to an endpoint, and invoke the endpoint for inference. The notebook also contains code to deploy an endpoint of your own.
To open the second notebook, choose Use Endpoint in Notebook. This opens the inference notebook to use the already deployed example endpoint. In the inference notebook, you can see how to prepare the data to invoke the example endpoint to do inference on a batch of examples. The endpoint returns predicted ratings, as shown in the following screenshot, in the last code block of the inference notebook.
You can use this solution as a template for a text-enhanced credit rating model. It shows how to take a model based on numeric features (in this case Altman’s famous five variables) combined with SEC filings text so as to achieve a material improvement in the prediction of credit ratings. You’re not restricted to the Altman variables and can add more variables as needed, or change out the variables completely. The main objective in this notebook is to show how to enhance Altman’s Z-score model with text so you can use ML techniques to achieve a best-in-class model.
The Altman model is widely used by a range of users and is therefore taught as part of required coursework by the Corporate Finance Institute (CFI). Altman himself offered a 50-year retrospective on the model in parts 1, 2, and 3, discussing its wide use and misuse. To learn more, watch him on video and read this article. For a critique and improvement on the model, see the following article by Seeking Alpha, a well-known investor community. The Z-score Plus model is even available as an app on mobile devices.
Therefore, think of this workflow as a well-established starting point for the use of ML for credit scoring.
How to use this solution
To begin, run the notebooks on the example data within it to gain an understanding of how simple this solution is to use. This initiates modification of the notebook for your own model. The modification includes the following steps:
Bring in your tabular data, with one row for each firm’s financial data. This may be the same as that in the notebook (Altman’s variables) or any others you work with for credit modeling. There is no restriction on the number of variables.
Bring in your text data. The example in this post uses the SEC 10-K/Q filings, specifically the Management Discussion and Analysis section of the filings. If you want to download the latest filings and use them, see the SageMaker JumpStart solution for doing this in a single API call on SageMaker, titled SEC Filings Retrieval w/Summarizer and Scoring. This not only downloads the text you want, but also allows you to enhance the data with NLP scores, summaries, and more as additional columns in the DataFrame so that you can use several features of the text, such as readability, positivity, risk, litigiousness, and sentiment.
Join the data from Steps 1 and 2.
Reuse the solution notebooks with your new dataset with minimal changes required. The training notebook shows how to use AWS MXNet with its AutoGluon package for multimodal ML in a few lines of code and deploy an endpoint. The inference notebook shows how to call the endpoint to get predictions.
That’s it! The solution is self-contained and works with a few clicks.
Important: This solution is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. The associated notebooks, including the trained model, use synthetic data, and are not intended for production.
Deep dive with examples
You may want to explore the solution further. In the appendix, we offer more detail on credit scoring and some additional simple code to show how to add SEC text to standard tabular features to undertake multimodal ML. All these functionalities are made simple using APIs in SageMaker JumpStart models. We cover the following:
A review of Altman’s Z-score, a popular credit scoring model that may be used on private and public entities.
A discussion of the multimodal dataset.
How to retrieve SEC filings and combining it with tabular data, denoted as TabText. We show how to use the API in the SageMaker JumpStart example notebook titled SEC Filings Retrieval w/Summarizer and Scoring.
How to read in the data.
How to train and test your model on tabular data only and TabText. You can observe the improvement in the model as we expand the feature set with text.
How to add NLP scores to enhance the feature set. The API for this is also demonstrated in the SageMaker JumpStart example notebook titled SEC Filings Retrieval w/Summarizer and Scoring.
Summary
We have seen how to enhance tabular ML models for credit scoring with long-form financial text. You can adapt the training notebook and the inference notebook in the JumpStart solution Corporate Credit Rating Prediction with your own data and labels as follows:
Bring a dataset of tabular features for each ticker.
For each ticker use JumpStart’s SEC retrieval engine to download the required SEC filings (for example, download the most recent 10-K or 10-Q). Then join the text DataFrame with the tabular DataFrame, and further enhance the DataFrame with engineered features using NLP scoring.
Add in labels. For credit, these could be any of the following:
Ratings
Changes in ratings
Z-score
Probability of default
Defaulted or not
Credit spreads
Use the preceding AutoML code to train a classification or regression model.
Deploy the model to an endpoint. You can then call this endpoint as needed for new data.
SEC filings aren’t the only text that you can use. You can use any text that contains information about the label. For example, the text of internal rating analyses may be even better than SEC filings.
To get started, you can find the Corporate Credit Rating Prediction solution in SageMaker JumpStart in SageMaker Studio. For more information, see SageMaker JumpStart.
Legal Disclaimers: This post is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. This post uses data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions.
Thanks to several team members for support with this work: Miyoung Choi, Vinay Hanumaiah, Cuong Nguyen, Xavier Ragot, Derrick Zhang, Li Zhang, Yue Zhao, Daniel Zhu
Appendix
In this appendix, we discuss related topics to this solution.
What is Altman’s Z-score?
The model is based on a well-known bankruptcy prediction approach, from the original paper by Ed Altman (1968). For a brief summary, see Measuring the ‘fiscal-fitness’ of a company: The Altman Z-Score.
The original seminal paper by Altman is available at: Altman, Edward. (September 1968). “Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy“. Journal of Finance, v23(4): 189–209. [doi:10.1111/j.1540-6261.1968.tb00843.x]
The model uses eight inputs from a company’s financials:
Current Assets (CA)
Current Liabilities (CL)
Total Liabilities (TL)
EBIT (Earnings Before Interest and Taxes)
Total Assets (TA)
Net Sales (NS)
Retained Earnings (RE)
Market Value of Equity (MVE)
These eight inputs translate into the following five financial ratios:
A: EBIT / Total Assets
B: Net Sales / Total Assets
C: Market Value of Equity / Total Liabilities
D: Working Capital / Total Assets
E: Retained Earnings / Total Assets
These ratios are used to fit binary class data of companies that go bankrupt and those that do not. Altman fitted the model using Linear Discriminant Analysis, possibly the earliest use of ML in finance. The linear discriminant function is as follows:
Zscore = 3.3A+0.99B+0.6C+1.2D+1.4E
These translate into suggested company credit quality ranges, which may vary by use, such as in the following example:
𝑍 > 3.0 : safe
7 < 𝑍 < 2.99 : caution
8 < 𝑍 < 2.7 : bankruptcy possible in 2 years
𝑍 < 1.8 : high chance of bankruptcy
We enhance the Altman five-feature set (A,B,C,D,E stated above) with text from SEC filings to get an improved Z-score model.
Multimodal data
We created a synthetic dataset that combined randomly chosen SEC filings with simulated financial data. Briefly, we created the synthetic dataset using the following steps (ticker names have not been included, so as to not cause confusion with real tickers):
We extracted the Management Discussion and Analysis (MDNA) section of the 10-K/Q filings for a sample of 3,286 firm filings. We added a column to this data for the industry code, because it may also be a useful feature given that firms within an industry may all be impacted by the same factors. See the following section for the use of SageMaker JumpStart to retrieve SEC filings and construct a DataFrame.
We then scored the text for five positive attributes (positivity, sentiment, polarity, safety, certainty) and five negative attributes (negativity, litigiousness, fraud, risk, uncertainty). We added the positive scores and subtracted the negative scores to get a net rank score. We correlate the high rank score firms with high ratings and low rank firms with low ratings. This ensures that the text is correlated with the ratings label.
Then, we used official government sites to get a US balance sheet, income statement, and market statistics as anchors to simulate financials for all 3,286 firms. The financials are normalized so that the total assets for each firm are 100. The eight financial values are checked for consistency (for example, the simulated current liabilities can’t exceed the simulated total liabilities). Such cases are discarded and the financials are regenerated. The anchor statistics are taken from the following:
Balance sheet data
Income statement data
Price to book data, which is used to convert book value of equity into market value of equity (MVE).
Altman’s Z-score averages for the US
The eight financial values are converted into the five ratios needed by the Altman Z-score model.
Z-scores are computed for each firm.
The financial values for companies with high (low) Z-scores are concatenated to the text of companies with high (low) rank scores. Now we have a consolidated multimodal (text, numerical, categorical) dataset.
The high (low) rank companies are assigned high (low) ratings, and the number and mean Z-score of companies is adapted to calibrate with the table from Altman’s slides (slide 9). Z-scores and rank scores are then discarded.
The final dataset (stored as CCR_data.csv) comprises the MD&A text, industry code, and eight financial variables. The last column contains the rating, namely, the label for classification. The data contains seven categories of labels: AAA, AA, A, BBB, BB, B, CCC. These labels are not reflective of companies’ actual credit ratings since they are based on synthetically generated data. This synthetic dataset is automatically downloaded when you run the training notebook in the JumpStart solution Corporate Credit Rating Prediction in the training notebook described earlier.
Curate TabText
The following code is a template for constructing a text-enhanced credit rating model. It shows how to take a model based on numeric features (in this case Altman’s five variables) combined with SEC filings text so as to achieve a material improvement in the prediction of credit ratings. In this example, we observe an 8% increase in accuracy (on our example test data) when text is added.
SEC filings are retrieved from the SEC’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) website, which provides open data access (note the disclaimer in this post). EDGAR is the primary system under the US Securities and Exchange Commission (SEC) for companies and others submitting documents under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940. EDGAR contains millions of company and individual filings. The system processes about 3,000 filings per day, serves up 3,000 terabytes of data to the public annually, and accommodates 40,000 new filers per year on average. In the following code, we provide a simple, single API call that creates a dataset in a few lines of code, for any period of time and for a large number of tickers.
The API contains three parts:
The first section specifies the following:
The tickers or SEC CIK codes for the companies whose forms are being extracted
The SEC forms types (in this case 10-K, 10-Q)
Data range of forms by filing date
The output CSV file and Amazon Simple Storage Service (Amazon S3) bucket to store the dataset
The second section shows how to assign system resources and has default values in place
The final section runs the API
This kicks off the processing job running in a SageMaker container and makes sure that even a very large retrieval can run without the notebook connection.
The data is stored in a file denoted dataset_10k_10q.csv as shown in the preceding code. The file may be examined as follows:
The mdna column of text from this DataFrame is then combined with financial data to create a composite dataset, stored in a file titled CCR_data.csv, which is read in next. We denoted the composite of tabular and text data as TabText.
Read in the TabText dataset
We read in this dataset and examine its properties. It has 11 features: one text column, one categorical column, eight numerical columns, and a label column (Ratings). Whereas the values from this dataset match the broad averages in the economy, and we trained a model on this data, this model should be trained on real data from the user.
Next, we convert the financial values into Altman’s five ratios, resulting in the final DataFrame we use for multimodal ML:
The dataset has eight features: one text column, a categorical column, five numerical columns, and a label column. We have text of the MD&A section, industry code, five ratios (A, B, C, D, E) as described earlier developed by Altman. The label column is Rating.
As a cross-check, we compute the Z-score for each firm and examine the mean score by rating. The scores decline as the rating of firms drops. The confirms that the dataset captures the relationship between Z-scores and ratings. (Of course, we don’t use Z-score as a feature.)
Train and test ML models
Our dataset is multimodal and contains the following:
A column of long text, with documents exceeding the maximum number of words allowed by transformer models such as BERT
A categorical column, industry code
Five numerical columns containing the features used by the Altman Z-score model
A label for the rating of the company
We use the GluonNLP library based on the MXNet framework. Install the required packages. You can update the following example code with newer releases of mxnet. For newer releases of autogluon, see GluonNLP: NLP made easy.
Use only tabular data
First, we mimic the original version of the Altman model with just five financial ratios and industry code—this is just tabular data. We later fit an extended model with text and tabular data.
To start, we also choose a binary classification problem, where 1 = {AAA,AA,A,BBB} (namely, investment grade firms) and 0 = {BB,B,CCC} (below investment grade firms). We drop the text (MDNA) column from the dataset. In the solution itself, you will see a multi-category classification task, which we briefly highlight towards the end of the post.
Prepare the binary label based on rating:
Implement an 80/20 train/test split on the data:
AutoML
We use the parsimonious framework from AutoGluon. This library accepts DataFrames containing text, tabular, CV data, and fits models automatically using a set of well-known classifiers, such as 𝐾-nearest neighbors, Gradient Boosted models, Random Forest models, Boosted models, Extra Trees models, XGBoost, and Neural Net models. These models are then stack-ensembled to get the best weighted model. You can also perform hyperparameter tuning. For full details, see AutoGluon: AutoML for Text, Image, and Tabular Data. Complete the following steps:
Instantiate the AutoGluon model.
Indicate where the trained model will be stored.
Fit the model on the DataFrame. The fit requires just a single line of code.
For full reference, see AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.
Next, assess metrics to determine the best-performing model on the test data:
Note that balanced accuracy is average recall on both classes. MCC is the Matthews Correlation Coefficient.
We can also see the leaderboard generated from the preceding code and presented in order of validation score.
Use multimodal data
We then combine the text and tabular data to get a final model to showcase multimodal ML. The steps remain exactly the same as before. You don’t need to perform vectorization of the text or one-hot encoding of the categorical variable. All this is handled by MXNet/AutoGluon. Even the label is auto-detected, so the class of problem doesn’t need to be specified.
Because the text in these sections is very long (thousands of words), we can’t use transformers, because they have a restricted number of words they can handle (usually less than 1000). Therefore, AutoGluon uses TF-IDF with n-grams to transform the text into numerical vectors and then apply ML to the text and tabular data.
We fit a model with very few lines of code. This time, we don’t drop the text column containing the MD&A:
Accuracy on the test dataset has increased to 93% (on TabText) versus 85% (on the tabular dataset).
We also see below the leaderboard generated from the preceding code and presented in order of validation score.
Further enhance the feature set with NLP scoring
SageMaker JumpStart has its own SDK with an API to further enhance the feature set with numerical values that score the text (in column MDNA) in the dataset for its various attributes. To see how to use this API, refer to the JumpStart example notebook SEC Filings Retrieval w/Summarizer and Scoring. This adds columns with additional values based on the percentage of words in the text that match separate word lists for each attribute, or the attribute may be based on an algorithm such as sentiment scoring and readability. You have 11 attributes: negative, certainty, risk, uncertainty, safe, fraud, litigious, positive, polarity, sentiment, and readability.
We use the Gunning fog index to calculate the readability score. Sentiment analysis uses VADER. Polarity calculation uses positive and negative word lists. The other NLP scores deliver the similarity (word frequency) with the default word lists (positive, negative, litigious, risk, fraud, safe, certainty, and uncertainty) provided through the smjsindustry library. You can also provide your own word list to calculate the NLP score of your own scoring types.
These numerical scores are added as new columns to the text DataFrame. This creates a multimodal DataFrame that is a mixture of tabular data and long-form text, called TabText. When you submit this DataFrame for ML, it’s a good idea to normalize the columns of NLP scores (usually with standard normalization or min-max scaling).
These scoring metrics are simple and report the proportion of words in a document that occur in a specified word list. The word lists aren’t the traditional financial word lists that are human curated, but are word lists that are generated from word embeddings that are close to the concepts that are being scored. Therefore, they may also contain words that don’t obviously relate to a concept (e.g., risk), but their occurrence implies the presence of discussion related to the concept. You can even bring your own word lists to quantify additional concepts (for example, ESG). This API call is shown in the following code:
This generates an extended DataFrame.
Instead of training for binary classification as we did earlier, we can use the seven rating classes in the dataset for multicategory classification. The details of training the model on this extended DataFrame are provided in the training notebook Corporate Credit Rating Prediction solution in SageMaker JumpStart. The final performance on a sample of the data is shown in the confusion matrix.
We can observe that the trained model is accurate on the test dataset, even though we trained it on a small subset of the data.
SageMaker makes it simple to deploy the model to an endpoint. As we discussed, you can then use this for inference, and the technical details (a few lines of code) are also shown in the training and inference notebooks that come with this solution.
About the Authors
Dr. Sanjiv Das is an Amazon Scholar and the Terry Professor of Finance and Data Science at Santa Clara University. He holds post-graduate degrees in Finance (M.Phil and Ph.D. from New York University), Computer Science (M.S. from UC Berkeley), an MBA from the Indian Institute of Management, Ahmedabad. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice-President at Citibank. He works on multimodal machine learning in the area of financial applications.
Dr. John He is a senior software development engineer with Amazon AI, where he focuses on machine learning and distributed computing. He holds a PhD degree from CMU.
Shenghua Yue is a Software Development Engineer at Amazon SageMaker. She focuses on building machine learning tools and products for customers.
Read MoreAWS Machine Learning Blog