This blog is a comprehensive overview of different methods of extracting structured text using OCR from SEC Forms (10Q, 10K, 8K) to automate manual data entry.
There’s a growing need in the financial markets for faster access to the financial information that supports trading decisions. Investors see these financial documents as being crucial for understanding the financial status of companies and also to avoid fraud. It’s quite hard to search and go through all the voluminous documents filed with the SEC to obtain the information an investor needs. Therefore to save time, we’ll have to rely on intelligent systems to extract and store all the required data from these financial documents. In this blog, we’ll learn how one can parse or extract information from these SEC forms. Additionally, let’s also look at different techniques that are used for achieving this task that include Optical Character Recognition and Deep Learning. See the table of contents below:
- SEC Filings-All About 10Q 10K and 8K Forms
- Information Extraction from SEC Forms
- Available Datasets and Annotations for IE
- Popular Deep Learning Architectures
SEC Filings-All About 10Q 10K and 8K Forms
The U.S. Securities and Exchange Commission (SEC) is an independent federal government administrative agency responsible for protecting investors, maintaining fair and orderly functioning of the securities markets, and facilitating capital formation. To achieve this, it requires all public companies, company insiders, and broker-dealers to file financial statements annually. Based on these documents, investors can review a company’s profile and it’s activities. Now, let’s see some of the most common forms that companies are required to submit to the SEC.
It consists of the company’s annual report and comprehensive analysis of the company’s financial condition. Below are some of the fields that give a quick overview of a company. It is essential to extract and store such information from these forms.
- Company Name
- Company Address
- Employer Identification Number
- Common Stock List
- Current Share Values
- Products List
- Security Companies List
- Balance Sheet Tables
- Income Sheet Tables
- Cash Flow Statements
These are some of the fields that are commonly looked at but there are a lot of others to look at too. All companies must file these 10K forms within 60 to 90 days of the close of their fiscal year. This 10K form seems large, right? That’s why the SEC introduced 10Q forms, which are truncated versions of 10K forms. Let’s learn about them in the next section.
The 10Q form is a quarterly, unaudited report. They are traditionally the comprehensive report of a company’s performance. Usually, companies are required to disclose the 10Q documents with relevant information regarding their financial position. There is no filing after the fourth quarter because that is when the 10-K is filed. Below are some of the essential information that can be extracted from 10Q forms.
- Financial Position of Company in Tables
- Management Discussions
- Working Capital
- Amounts Used and Received
- Market Risks
To perform information extraction from these forms, one must be aware of both table extraction and key-value pair extraction as there are no specific templates.
The Form 8-K is what a company uses to disclose significant developments that occur between filings of the Form 10-K or Form 10-Q. Major organizational/company events that would necessitate the filing of a Form 8-K include bankruptcies or receiverships, material impairments, completion of acquisition or disposition of assets, and departures or appointments of executives.
In the next section, let’s dive into how we can extract information from these forms.
Information Extraction from SEC Forms
Unlike invoices or receipts, extracting information from SEC forms is quite a challenging task. As every company has its way of representing all this financial information in different tables and key-value pairs, we’ll have to make sure that the OCR algorithms we use are more robust and intelligent. Before learning more about these techniques, let’s understand what OCR is about.
The OCR techniques are not new, but they have been continuously evolving with time. Out of these, one popular and commonly used OCR engine is Tesseract. It’s an open-source python-based software developed by Google. However, even popular tools like Tesseract fail to extract text in some complex scenarios. They blindly extract text from given images without any processing or rules. Hence they require some intelligent algorithms backing them; this is where deep learning comes into the picture.
Here’s an example 10Q form found online. The output shown below is what we would get if we use Tesseract to extract all the tables and important information:
$ 6,775 $ 8,575 Costs and expenses: Cost of revenues (including stock-based compensation expense of $6 and $49) 2,452 2,936 Research and development (including stock-based compensation expense of $191 and $237) 818 1,226 Sales and marketing (including stock-based compensation expense of $54 and $78) 607 1,026 General and administrative (including stock-based compensation expense of $40 and $68)
This isn’t orderly or usable, right? As discussed, the core job of OCR is to extract all the text from a given document irrespective of template, layout, language, or fonts. But our goal is to pick all the critical information like customer name, form type, and financial details from the SEC forms that aren’t handled by the top OCR engines like Tesseract and others. Therefore, we rely on deep learning which is trained on huge datasets and enable the models to learn. Let’s discuss them in the next section.
Available Datasets for SEC Forms
To make the OCR and the deep learning models, one must train them with consistent data sets. Currently, there are no great tools available online that can automatically extract information from any form. Therefore, after we collect datasets, we’ll have to make sure we build a state of the art deep learning model that does this job. First things first, let’s see how we can prepare a dataset.
As this is publicly available data, we can download them company wise or use the available checkpoints present in open-source projects.
- The SEC filings index is split into quarterly files since 1993 (1993-QTR1, 1993-QTR2…) and these can be found online here.
- We can use the python-Edgar repository to download the SEC forms using the Python scripts.
- Several forms are publicly available in this link here.
Once datasets are downloaded, the next step is to use an annotator to annotate all the required information in the SEC forms. Using these annotation files, we can train the deep learning model. Here are links to some of the open-source annotation tools available on Github.
In the next section, let’s look at a few deep learning models we can use for Information Extraction.
Some Popular Deep Learning Architectures and NLP Techniques
There are two ways for information extraction using deep learning, one building algorithms that can learn from images, and the other from the text.
Let’s dive into deep learning and understand how these algorithms identify key-value pairs from images or text. Also, especially for SEC forms, it’s essential to extract the data in the tables, as most of the information in SEC forms are mentioned in tabular format. Now, let’s review some popular deep learning architectures for scanned documents.
- LayoutML: LayoutML is an open-source project by Microsoft for document image understanding. The authors propose pre-training techniques and models that are hugely based on NLP. The idea behind LayoutML is to jointly model interactions between text and layout information across the scanned document image. This is beneficial for a large number of real-world document image understanding tasks like information extraction from scanned documents. We can use models like these to train the EDGAR datasets and build intelligent algorithms for key-value extraction tasks.
- CUTIE (Learning to Understand Documents with Convolutional Universal Text Information Extractor): In this research, Xiaohui Zhao proposes extracting key information from documents like receipts or invoices and preserving interesting texts to structured data. The heart of this research is the convolutional neural networks that are applied to texts. Here, texts are embedded as features with semantic connotations. This model is trained on 4, 484 labeled receipts and has achieved 90.8%, 77.7% average precision on taxi receipts and entertainment receipts, respectively.
- Named Entity Recognition: Named Entity Recognition allows us to evaluate a chunk of text and find out different entities from it – entities that don’t just correspond to a category of a token but apply to variable lengths of phrases. The models take into consideration the start and end of every relevant phrase according to the classification categories the model is trained for. Therefore, for SEC documents, we can train NER models to perform key-value pair extraction. To learn more about NER, read our blog here.
- BERTgrid is a popular deep learning-based language model for understanding generic documents and performing key-value pair extraction tasks. This model also utilizes convolutional neural networks based on semantic instance segmentation for running the inference. Overall, the mean accuracy of the selected document header and line items is 65.48%.
In DeepDeSRT, Schreiber et al. presented the end-to-end system for table understanding in document images. The system contains two subsequent models for table detection and structured data extraction in the recognized tables. It outperformed state-of-the-art methods for table detection and structure recognition by achieving F1-measures of 96.77% and 91.44% for table detection and structure recognition, respectively. Models like these can be used to extract values from tables of pay slips exclusively.