INDUSTRY: Finance
Introduction: In recent years traditional factors of financial markets, for instance growth vs. value, market capitalization, credit rating, stock price volatility, have become less predictive, requiring investors to explore new data sources such as news, images, social networks content etc. The goal of the current project is to build an instrument, which classifies the companies into different market segments in which they operate.
The classification is based on the “Thomson Reuters Business Classification” taxonomy. In this white paper, we describe a Natural Language Processing (NLP) approach using Spark NLP and semantic techniques to assist the domain experts in classifying the documents with different market labels.
Challenge: “Lack of reliable training data: Labeling data for the Market classification task is a complicated and time-consuming manual process, which has a lot of bias. In general, tagging texts is very subjective to a person’s perspective. In the Market taxonomy some tags have very close meaning and can be mislabeled not only by a model, but by a human annotator as well.
Errors during the batch labeling: Due to the limited resources and time frames for the project, it was challenging to review precisely every text in the data set during the labeling. Because of that labeling has been done based on the keywords in the text entity. This approach causing the errors in the dataset and may produce not reliable results.
Low-Quality Data: Another challenge for this project is to deal with Markets with a lesser number of tagged entities and entities with a corrupted body. If models will be trained on this data it will not perform with expected accuracy.
Need for large training data: As this taxonomy require a big amount of training data (61 categories). The minimum requirement is 100 texts per category.
Dynamic market taxonomy: Each day many new companies as well as new markets are coming into existence, this may create false negative labels for some of the entities.”
Solution: “To overcome the lack of reliable training and labelling data, we trained models and we created labeling guidance for annotators team and partially labeled dataset.
To solve this issue of the errors during the batch labeling, we implemented outliers detection algorithm, which allows us to filter the most distant texts, based on the center of the cluster they belong to, which represent average semantic for each market.
The solution for the low-quality data, we removed short texts and texts with a corrupted body during the data preprocessing. Additionally, we analyzed wrong predictions of the model, to make sure, what annotator labeled initial text correctly. To make sure we are not overfitting the model random state of train/test/validation split was changed on every data regeneration.
To partially automate this process, we added a semi-supervised algorithm, which allows us to automatically label texts within cluster borders for each market.
Added after processing to the models and extract entities, which are not relevant to the any of existing markets. That examples should be processed manually and taxonomy for the model should be extended if required.”
“Result: We implemented a multilabel classifier for the TRBC classification taxonomy in the level of market tags. The key components of the project include, but not limited to – advanced data preprocessing, training on the bigdata set by using distributed libraries, the release of the model to the production environment, and integration to the customer IT infrastructure. This implementation allows customer’s analytic team to control model performance and have an easy access to the model quality metrics.
“The goal of the current project is to build an instrument, which classifies the companies into different market segments in which they operate.”