The Complete Guide to Information Extraction from Texts with Spark NLP and Python

06.04.2023

Gursev Pirge

Researcher and Data Scientist

Extract Hidden Insights from Texts at Scale with Spark NLP

Information extraction in natural language processing (NLP) is the process of automatically extracting structured information from unstructured text data. Spark NLP has many solutions for identifying specific entities from large volumes of text data, and converting them into a structured format that can be analyzed and used for subsequent applications.

Information extraction is a subfield of NLP that involves the automated identification and extraction of structured information from unstructured or semi-structured text data. The goal of information extraction is to transform unstructured text data into structured data that can be easily analyzed, searched, and visualized.

2.5 quintillion bytes of data (10 to the power 18 or billion billion) are generated everyday, ranging from social media posts to scientific literature, so information extraction has become an essential tool for processing this data. As a result, many organizations rely on Information Extraction techniques to use solid NLP algorithms to automate manual tasks.

Information extraction involves identifying specific entities, relationships, and events of interest in text data, such as named entities like people, organizations, dates and locations, and converting them into a structured format that can be analyzed and used for various applications. Information extraction is a challenging task that requires the use of various techniques, including named entity recognition (NER), regular expressions, and text matching, among others. There is a wide range of applications involved, including search engines, chatbots, recommendation systems, and fraud detection, among others, making it a vital tool in the field of NLP.

TextMatcher and BigTextMatcher are annotators that are used to match and extract text patterns from a document. The difference between them is, BigTextMatcher is designed for large corpora.

TextMatcher works by defining a set of rules that specify the patterns to match and how to match them. Once the rules are defined, TextMatcher can be used to extract the matched text patterns from the dataset.

In this post, you will learn how to use Spark NLP to perform information extraction efficiently. We will discuss identifying keywords or phrases in text data that correspond to specific entities or events of interest by the TextMatcher or BigTextMatcher annotators of the Spark NLP library.

Let us start with a short Spark NLP introduction and then discuss the details of the information extraction techniques with some solid results.

Introduction to Spark NLP

Spark NLP is an open-source library maintained by John Snow Labs. It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment.

Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing:

A single unified solution for all your NLP needs (for Clinical needs or biomedicine, or as a financial tool, etc.)
Transfer learning and implementing the latest and greatest SOTA algorithms and models in NLP research
The most widely used NLP library for Java and Python in industry (5 years in a row)
The most scalable, accurate and fastest library in NLP history

Spark NLP comes with 14,500+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster.

Spark NLP processes the data using Pipelines, structure that contains all the steps to be run on the input data:

Each step contains an annotator that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation.

An annotator in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral.

Setup

To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example:

!pip install spark-nlp
!pip install pyspark

For other installation options for different environments and machines, please check the official documentation.

Then, simply import the library and start a Spark session:

import sparknlp

# Start Spark Session
spark = sparknlp.start()

TextMatcher

TextMatcher is an annotator of Spark NLP, which means that it operates by matching exact phrases (by token) provided in a file against a document.

TextMatcher expects DOCUMENT and TOKEN as input, and then will provide CHUNK as output. Here’s an example of how to use TextMatcher in Spark NLP to extract Person and Location entities from unstructured text. First, let’s create a dataframe of a sample text:

# Create a dataframe from the sample_text
data = spark.createDataFrame([
["As she traveled across the world, Emma visited many different places 
and met many fascinating people. She walked the busy streets of Tokyo, 
hiked the rugged mountains of Nepal, and swam in the crystal-clear waters 
of the Caribbean. Along the way, she befriended locals like Akira, Rajesh, 
and Maria, each with their own unique stories to tell. Emma's travels took her 
to many cities, including New York, Paris, and Hong Kong, where she savored 
delicious foods and explored vibrant cultures. No matter where she went, 
Emma always found new wonders to discover and memories to cherish."]
]).toDF("text")

Let’s define the names and locations that we seek to match and save them as text files:

# PERSON
person_matches = """
Emma
Akira
Rajesh
Maria
"""

with open('person_matches.txt', 'w') as f:
    f.write(person_matches)

# LOCATION
location_matches = """
Tokyo
Nepal
Caribbean
New York
Paris
Hong Kong
"""

with open('location_matches.txt', 'w') as f:
    f.write(location_matches)

Spark NLP uses pipelines; below is the short pipeline that we will use to match the entities in the sample text. Notice that the setCaseSensitive parameter is set to False to allow matching of both uppercase and lowercase versions of the words:

# Import the required modules and classes
from sparknlp.base import DocumentAssembler, Pipeline, ReadAs
from sparknlp.annotator import (
    Tokenizer,
    TextMatcher
)
from pyspark.sql.types import StringType
import pyspark.sql.functions as F

# Step 1: Transforms raw texts to `document` annotation
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Step 2: Gets the tokens of the text    
tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

# Step 3: PERSON matcher
person_extractor = TextMatcher() \
    .setInputCols("document", "token") \
    .setEntities("person_matches.txt", ReadAs.TEXT) \
    .setEntityValue("PERSON") \
    .setOutputCol("person_entity") \
    .setCaseSensitive(False)

# Step 4: LOCATION matcher
location_extractor = TextMatcher() \
    .setInputCols("document", "token") \
    .setEntities("location_matches.txt", ReadAs.TEXT) \
    .setEntityValue("LOCATION") \
    .setOutputCol("location_entity") \
    .setCaseSensitive(False)


pipeline = Pipeline().setStages([document_assembler,
                                 tokenizer,
                                 person_extractor,
                                 location_extractor
                                 ])

We fit and transform the dataframe (data) to get the results for the Person entities:

# Fit and transform to get a prediction
results = pipeline.fit(data).transform(data)

# Display the results
results.selectExpr("person_entity.result").show(truncate=False)

Now, fit and transform the dataframe (data) to get the results for the Location entities:

results.selectExpr("location_entity.result").show(truncate=False)

In this example, we created two TextMatcher stages, one matches person names and the other stage matches locations. Once the Spark NLP pipeline is applied to the sample text, any words that match the specified words are extracted.

BigTextMatcher

BigTextMatcher is an extension of the TextMatcher component in Spark NLP that allows users to match and extract patterns of text in very large documents or corpora. It is designed to handle datasets that are too large to fit in memory and provides an efficient way to perform distributed pattern matching and text extraction using Spark NLP.

BigTextMatcher works by taking a set of words or phrases as input and building an efficient data structure to store them. This data structure allows the BigTextMatcher to quickly match the input text against the large set of words or phrases, making it much faster than the TextMatcher for large sets of data.

Let’s work on a similar example to TextMatcher, which extracts person entities from a text. First we have to create a dataframe from a sample text:

# Create a dataframe from the sample_text
data = spark.createDataFrame([
["Natural language processing (NLP) has been a hot topic in recent years, 
with many prominent researchers and practitioners making significant 
contributions to the field. For example, the work of Karen Smith and 
John Johnson on sentiment analysis has helped businesses better understand 
customer feedback and improve their products and services. Meanwhile, 
the research of Dr. Jane Kim on neural machine translation has revolutionized 
the way we approach language translation. Other notable figures in the field 
include Michael Brown, who developed the popular Stanford NLP library, and 
Prof. Emily Zhang, whose work on text summarization has been widely cited 
in academic circles. With so many talented individuals pushing the boundaries 
of what's possible with NLP, it's an exciting time to be involved in this 
rapidly evolving field."]]).toDF("text")

Now define the names that we want to match and save them as a file:

# PERSON
person_matches = """
Karen Smith
John Johnson
Jane Kim
Michael Brown
Emily Zhang
"""

with open('person_matches.txt', 'w') as f:
    f.write(person_matches)

We will use the following pipeline:

# Import the required modules and classes
from sparknlp.base import DocumentAssembler, Pipeline, ReadAs
from sparknlp.annotator import (
    Tokenizer,
    BigTextMatcher
)
from pyspark.sql.types import StringType
import pyspark.sql.functions as F

# Step 1: Transforms raw texts to `document` annotation
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Step 2: Gets the tokens of the text    
tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

# Step 3: PERSON matcher
person_extractor = BigTextMatcher() \
    .setInputCols("document", "token") \
    .setStoragePath("person_matches.txt", ReadAs.TEXT) \
    .setOutputCol("person_entity") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([document_assembler,
                                 tokenizer,
                                 person_extractor])

We fit and transform the dataframe (data) to get the results for the Person entities:

# Fit and transform to get a prediction
results = pipeline.fit(data).transform(data)

# Display the results
results.selectExpr("person_entity.result").show(truncate=False)

In this example, we created a BigTextMatcher stage, which matches previously defined person names from the text. Once the Spark NLP pipeline is applied to the sample text, those specified words are extracted.

For additional information, please consult the following references:

Documentation : TextMatcher, BigTextMatcher
Python Docs : TextMatcher, BigTextMatcher
Scala Docs : TextMatcher, BigTextMatcher
For extended examples of usage, see the TextMatcher, BigTextMatcher.

Conclusion

TextMatcher & BigTextMatcher of Spark NLP enable identifying and extracting previously defined words/phrases from texts. Text matching is particularly useful when there is no clear pattern in the text data that can be captured using regular expressions.

Overall, text matching is an essential tool for anyone working in the field of NLP, and its importance is only likely to grow as the field continues to evolve and expand. By using text matching tools like TextMatcher and BigTextMatcher, data scientists and developers can save time and effort and create more accurate and effective NLP pipelines.

Try Text Annotator

See in action

Gursev Pirge

Researcher and Data Scientist

Our additional expert:

A Researcher and Data Scientist with demonstrated success delivering innovative policies and machine learning algorithms, having strong statistical skills, and presenting to all levels of leadership to improve decision making. Experience in Education, Logistics, Data Analysis and Data Science. Strong education professional with a Doctor of Philosophy (Ph.D.) focused on Mechanical Engineering from Boğaziçi University.

How to Detect Languages with Python: A Comprehensive Guide

Gursev Pirge

How to automatically identify different languages using out of the box pretrained models with Spark NLP in Python. Language detection is the...

The Complete Guide to Information Extraction from Texts with Spark NLP and Python

Introduction to Spark NLP

Setup

TextMatcher

BigTextMatcher

Conclusion

How to Detect Languages with Python: A Comprehensive Guide

Recommended For You