Why You (or your analysts) Shouldn’t Be Using Ctrl-F to Search Through Annual Reports
April 29, 2020 by Wian
Have you ever found yourself depending on Ctrl-F as you crawled through news, financial and company reports? Currently, it is not uncommon to see analysts and portfolio managers using this simple tool to find what they are looking for in a mountain of reports. This is clearly a suboptimal strategy. Much of the time spent on searching through these documents should be allocated to more valuable tasks.
Before discussing some solutions for these problems, it is important to understand why manually searching using Ctrl-F is ineffective in a lot of cases.
The Problem with Ctrl-F
For investment managers, continuous surveillance of the latest news is mandatory. However, if they need to stay on top of a large number of stocks, critical information can easily be missed from the overwhelming quantity of reports coming out, which could end up being very costly. Furthermore, in a world where time is so limited, knowing the major topics, companies, names, locations or figures that a report contains would allow analysts to prioritise more efficiently, instead of sifting through each one.
Today, most will search each report for a particular word using Ctrl-F - this ignores the context that the word lies in and the multiple definitions the word may have, and they have to tediously sift through many false positives. This is time consuming even for short reports and frustrating for long reports when hundreds of instances of the same word are found. Ctrl-F is more like a microscope and shouldn’t be used to understand the landscape.
Sometimes the report may not contain the original term but a synonym to it. For example, the most common term to report revenue is ‘Revenues’, however some companies report their revenue as ‘SalesRevenueNet’ or ‘SalesRevenueGoodsNet’.Other times, the report may contain both the word and the synonym. For example, current news will use coronavirus and COVD19 interchangeably. Due to this, it is inevitable that many key pieces of information, again, will be missed.
A lack of contextual understanding, and the absence of semantic awareness means relying solely on Ctrl-F is unnecessary time waste for you and your team. But it doesn’t have to be so. Modern Natural Language Processing Techniques (NLP) provide us with tools that enable an analyst to quickly skim through reports and ensure nothing is missed. We discuss two of them below.
Named-Entity Recognition (NER) is a particular subtask of text information extraction, which aims to identify the named “entities” (which might include places, names, companies, figures, currencies, percentages) that lie in the text and to classify them. An NER model is able to pick up appropriate phrases, and not just a single word, providing additional context. For instance, in the extract below, the model discovers the phrase “fiscal 2020 first quarter”, rather than just “2020” or “first quarter” because it is clear from the context that these words ought to be grouped together. Contextual understanding is a feature of NER that Ctrl-F does not have, and so NER has an immediate advantage in that way.
In NER, the term “entities” is rather broad and can refer to many different predefined categories that have a physical or abstract existence. For example, below is the first paragraph from Apple’s press release earlier this year. The paragraph has been analysed using Amazon’s Standard NER model, and some of the relevant words or phrases have been highlighted.
Extracting these entities is not as easy as it might first appear. For instance, Apple the corporation must be differentiated from the fruit. NER is able to leverage the context that a word lies in to make accurate classifications. Allowing it to disambiguate a word’s meaning based on the context, then label the word appropriately.
NER also leads naturally onto other methods to extract and organise information from unstructured text. For instance, you can combine NER with sentiment analysis (where text is given a sentiment rating by a model to determine how negative, positive, or neutral it is). Knowing at a glance the sentiment of the report, as well as which companies are mentioned, is a powerful combination. Ctrl-F doesn’t lead onto any other intelligent ways to analyse text.
Unstructured text is rich with information and with a vast number of reports to read through, NER can be an invaluable tool to quickly extract and categorise the key names, companies, figures and locations so that an informed decision can be made about which reports to read. NER is able to use context, something not intrinsic to Ctrl-F, to find these entities. NER can also be used as a stepping stone to other forms of text processing, such as sentiment analysis.
In the case where there are a number of reports to read in a limited amount of time, topic modelling can be another useful tool. A topic modelling tool extracts the underlying topics, or themes, within a report. Then, having knowledge of the topics that are covered in a document, one can make smart choices about which documents to read, and which have duplicate or irrelevant information.
Topic modelling tackles the first issue associated with Ctrl-F, namely that when there are a great deal of reports to read, topic modelling ensures the analyst is aware of all of the essential points within each article. Having this functionality makes certain that no high-level paramount information is missed, before going onto deeper research.
For example, topic modelling can be used to summarise news for a given day, before doing detailed research on identified topics. Ctrl-F would not fare well with such a task because, for a given day, there are hundreds of reports, each containing multiple synonyms for any given concept. Looking for individual words using Ctrl-F is hopeless and will lead to too much important information being missed.
As a real example, using Reuters wire reports for a random day, these are two of the topics (topic 11 and 16) returned after running topic modelling:
Clearly topic 11 (above) is discussing something to do with banking or the Franc, while topic 16 (below) is covering something related to an airport and security. Suppose topic 16 was of interest. Then the model can sort all of the days reports to find those with topic 16 as most prevalent. Indeed, one of the articles that reports topic 16 as predominant has the headline: “'Web of agencies at U.S. airports could hinder security overhauls'”.
Using topic modelling in cases like this makes it really easy to see what is covered in large volumes of reports, something unavailable with Ctrl-F. Displaying the key themes in a report, topic modelling is best used when there are too many articles to read through manually and where prioritisation is unavoidable. Topic modelling has substantial potential to help an analyst ensure that nothing is missed before going into more depth.
Using Ctrl-F, as discussed, is suboptimal when there are many reports to search through. Topic modelling and named entity recognition are useful tools for extracting information from text without having to manually search through an entire document. NER identifies the entities in each report and topic modelling the main topics discussed. This makes it quick and easy to find reports that are most relevant to the research and to segregate reports based on their topics.
Overall, the best strategy is to understand when to use topic modelling and NER versus Ctrl-F (and when to combine them). For example, the following cases would favour the use of Ctrl-F:
- if the document is very short
- if it is clear what the target word or phrase is
- if the word or phrase is considerably specialised (such as a medical term or a new company).
For most other cases, the use of NER and topic modelling is a beneficial addition. The power behind NER and topic modelling is that these techniques are able to use all the information within an article, and systematically present only the most useful and actionable information from this, making the research process more consistent. All of this saves colossal amounts of time and helps analysts focus on the type of work that they are best at - finding profitable investment opportunities.
Share this Article:
Auquan is a data science solutions provider for asset managers and hedge funds. Our state of the art technology empowers Portfolio Managers to stay ahead of the trend and achieve better returns.