was successfully added to your cart.

    The Complete Guide to Information Extraction from Texts with Spark NLP and Python

    Avatar photo
    Researcher and Data Scientist

    Extract Hidden Insights from Texts at Scale with Spark NLP

    Information extraction in natural language processing (NLP) is the process of automatically extracting structured information from unstructured text data. Spark NLP has many solutions for identifying specific entities from large volumes of text data, and converting them into a structured format that can be analyzed and used for subsequent applications.

    Information extraction is a subfield of NLP that involves the automated identification and extraction of structured information from unstructured or semi-structured text data. The goal of information extraction is to transform unstructured text data into structured data that can be easily analyzed, searched, and visualized.

    2.5 quintillion bytes of data (10 to the power 18 or billion billion) are generated everyday, ranging from social media posts to scientific literature, so information extraction has become an essential tool for processing this data. As a result, many organizations rely on Information Extraction techniques to use solid NLP algorithms to automate manual tasks.

    Information extraction involves identifying specific entities, relationships, and events of interest in text data, such as named entities like people, organizations, dates and locations, and converting them into a structured format that can be analyzed and used for various applications. Information extraction is a challenging task that requires the use of various techniques, including named entity recognition (NER), regular expressions, and text matching, among others. There is a wide range of applications involved, including search engines, chatbots, recommendation systems, and fraud detection, among others, making it a vital tool in the field of NLP.

    TextMatcher and BigTextMatcher are annotators that are used to match and extract text patterns from a document. The difference between them is, BigTextMatcher is designed for large corpora.

    TextMatcher works by defining a set of rules that specify the patterns to match and how to match them. Once the rules are defined, TextMatcher can be used to extract the matched text patterns from the dataset.

    In this post, you will learn how to use Spark NLP to perform information extraction efficiently. We will discuss identifying keywords or phrases in text data that correspond to specific entities or events of interest by the TextMatcher or BigTextMatcher annotators of the Spark NLP library.

    Let us start with a short Spark NLP introduction and then discuss the details of the information extraction techniques with some solid results.

    Introduction to Spark NLP

    Spark NLP is an open-source library maintained by John Snow Labs. It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment.

    Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing:

    • A single unified solution for all your NLP needs (for Clinical needs or biomedicine, or as a financial tool, etc.)
    • Transfer learning and implementing the latest and greatest SOTA algorithms and models in NLP research
    • The most widely used NLP library for Java and Python in industry (5 years in a row)
    • The most scalable, accurate and fastest library in NLP history

    Spark NLP comes with 14,500+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster.

    Spark NLP processes the data using Pipelines, structure that contains all the steps to be run on the input data:

    NLP pipeline architecture.

    Each step contains an annotator that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation.

    An annotator in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral.

    Setup

    To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example:

    !pip install spark-nlp
    !pip install pyspark

    For other installation options for different environments and machines, please check the official documentation.

    Then, simply import the library and start a Spark session:

    import sparknlp
    
    # Start Spark Session
    spark = sparknlp.start()

    TextMatcher

    TextMatcher is an annotator of Spark NLP, which means that it operates by matching exact phrases (by token) provided in a file against a document.

    TextMatcher expects DOCUMENT and TOKEN as input, and then will provide CHUNK as output. Here’s an example of how to use TextMatcher in Spark NLP to extract Person and Location entities from unstructured text. First, let’s create a dataframe of a sample text:

    # Create a dataframe from the sample_text
    data = spark.createDataFrame([
    ["As she traveled across the world, Emma visited many different places 
    and met many fascinating people. She walked the busy streets of Tokyo, 
    hiked the rugged mountains of Nepal, and swam in the crystal-clear waters 
    of the Caribbean. Along the way, she befriended locals like Akira, Rajesh, 
    and Maria, each with their own unique stories to tell. Emma's travels took her 
    to many cities, including New York, Paris, and Hong Kong, where she savored 
    delicious foods and explored vibrant cultures. No matter where she went, 
    Emma always found new wonders to discover and memories to cherish."]
    ]).toDF("text")

    Let’s define the names and locations that we seek to match and save them as text files:

    # PERSON
    person_matches = """
    Emma
    Akira
    Rajesh
    Maria
    """
    
    with open('person_matches.txt', 'w') as f:
        f.write(person_matches)
    
    # LOCATION
    location_matches = """
    Tokyo
    Nepal
    Caribbean
    New York
    Paris
    Hong Kong
    """
    
    with open('location_matches.txt', 'w') as f:
        f.write(location_matches)

    Spark NLP uses pipelines; below is the short pipeline that we will use to match the entities in the sample text. Notice that the setCaseSensitive parameter is set to False to allow matching of both uppercase and lowercase versions of the words:

    # Import the required modules and classes
    from sparknlp.base import DocumentAssembler, Pipeline, ReadAs
    from sparknlp.annotator import (
        Tokenizer,
        TextMatcher
    )
    from pyspark.sql.types import StringType
    import pyspark.sql.functions as F
    
    # Step 1: Transforms raw texts to `document` annotation
    document_assembler = DocumentAssembler() \
        .setInputCol("text") \
        .setOutputCol("document")
    
    # Step 2: Gets the tokens of the text    
    tokenizer = Tokenizer() \
        .setInputCols("document") \
        .setOutputCol("token")
    
    # Step 3: PERSON matcher
    person_extractor = TextMatcher() \
        .setInputCols("document", "token") \
        .setEntities("person_matches.txt", ReadAs.TEXT) \
        .setEntityValue("PERSON") \
        .setOutputCol("person_entity") \
        .setCaseSensitive(False)
    
    # Step 4: LOCATION matcher
    location_extractor = TextMatcher() \
        .setInputCols("document", "token") \
        .setEntities("location_matches.txt", ReadAs.TEXT) \
        .setEntityValue("LOCATION") \
        .setOutputCol("location_entity") \
        .setCaseSensitive(False)
    
    
    pipeline = Pipeline().setStages([document_assembler,
                                     tokenizer,
                                     person_extractor,
                                     location_extractor
                                     ])

    We fit and transform the dataframe (data) to get the results for the Person entities:

    # Fit and transform to get a prediction
    results = pipeline.fit(data).transform(data)
    
    # Display the results
    results.selectExpr("person_entity.result").show(truncate=False)

    Now, fit and transform the dataframe (data) to get the results for the Location entities:

    results.selectExpr("location_entity.result").show(truncate=False)

    In this example, we created two TextMatcher stages, one matches person names and the other stage matches locations. Once the Spark NLP pipeline is applied to the sample text, any words that match the specified words are extracted.

    BigTextMatcher

    BigTextMatcher is an extension of the TextMatcher component in Spark NLP that allows users to match and extract patterns of text in very large documents or corpora. It is designed to handle datasets that are too large to fit in memory and provides an efficient way to perform distributed pattern matching and text extraction using Spark NLP.

    BigTextMatcher works by taking a set of words or phrases as input and building an efficient data structure to store them. This data structure allows the BigTextMatcher to quickly match the input text against the large set of words or phrases, making it much faster than the TextMatcher for large sets of data.

    Let’s work on a similar example to TextMatcher, which extracts person entities from a text. First we have to create a dataframe from a sample text:

    # Create a dataframe from the sample_text
    data = spark.createDataFrame([
    ["Natural language processing (NLP) has been a hot topic in recent years, 
    with many prominent researchers and practitioners making significant 
    contributions to the field. For example, the work of Karen Smith and 
    John Johnson on sentiment analysis has helped businesses better understand 
    customer feedback and improve their products and services. Meanwhile, 
    the research of Dr. Jane Kim on neural machine translation has revolutionized 
    the way we approach language translation. Other notable figures in the field 
    include Michael Brown, who developed the popular Stanford NLP library, and 
    Prof. Emily Zhang, whose work on text summarization has been widely cited 
    in academic circles. With so many talented individuals pushing the boundaries 
    of what's possible with NLP, it's an exciting time to be involved in this 
    rapidly evolving field."]]).toDF("text")

    Now define the names that we want to match and save them as a file:

    # PERSON
    person_matches = """
    Karen Smith
    John Johnson
    Jane Kim
    Michael Brown
    Emily Zhang
    """
    
    with open('person_matches.txt', 'w') as f:
        f.write(person_matches)

    We will use the following pipeline:

    # Import the required modules and classes
    from sparknlp.base import DocumentAssembler, Pipeline, ReadAs
    from sparknlp.annotator import (
        Tokenizer,
        BigTextMatcher
    )
    from pyspark.sql.types import StringType
    import pyspark.sql.functions as F
    
    # Step 1: Transforms raw texts to `document` annotation
    document_assembler = DocumentAssembler() \
        .setInputCol("text") \
        .setOutputCol("document")
    
    # Step 2: Gets the tokens of the text    
    tokenizer = Tokenizer() \
        .setInputCols("document") \
        .setOutputCol("token")
    
    # Step 3: PERSON matcher
    person_extractor = BigTextMatcher() \
        .setInputCols("document", "token") \
        .setStoragePath("person_matches.txt", ReadAs.TEXT) \
        .setOutputCol("person_entity") \
        .setCaseSensitive(False)
    
    pipeline = Pipeline().setStages([document_assembler,
                                     tokenizer,
                                     person_extractor])

    We fit and transform the dataframe (data) to get the results for the Person entities:

    # Fit and transform to get a prediction
    results = pipeline.fit(data).transform(data)
    
    # Display the results
    results.selectExpr("person_entity.result").show(truncate=False)

    In this example, we created a BigTextMatcher stage, which matches previously defined person names from the text. Once the Spark NLP pipeline is applied to the sample text, those specified words are extracted.

    For additional information, please consult the following references:

    Conclusion

    TextMatcher & BigTextMatcher of Spark NLP enable identifying and extracting previously defined words/phrases from texts. Text matching is particularly useful when there is no clear pattern in the text data that can be captured using regular expressions.

    Overall, text matching is an essential tool for anyone working in the field of NLP, and its importance is only likely to grow as the field continues to evolve and expand. By using text matching tools like TextMatcher and BigTextMatcher, data scientists and developers can save time and effort and create more accurate and effective NLP pipelines.

    How useful was this post?

    Try Text Annotator

    See in action
    Avatar photo
    Researcher and Data Scientist
    Our additional expert:
    A Researcher and Data Scientist with demonstrated success delivering innovative policies and machine learning algorithms, having strong statistical skills, and presenting to all levels of leadership to improve decision making. Experience in Education, Logistics, Data Analysis and Data Science. Strong education professional with a Doctor of Philosophy (Ph.D.) focused on Mechanical Engineering from Boğaziçi University.

    How to Detect Languages with Python: A Comprehensive Guide

    How to automatically identify different languages using out of the box pretrained models with Spark NLP in Python. Language detection is the...