was successfully added to your cart.

    Coreference resolution with BERT-based Models

    Avatar photo
    Researcher and Data Scientist

    See how Bert-based models in Spark NLP can effortlessly resolve co-reference in your text data

    Coreference resolution with BERT-based Models.

    Coreference resolution is the task of identifying and linking all expressions within a text that refer to the same real-world entity, such as a person, object, or concept. Using Spark NLP, it is possible to perform many NLP applications, including text understanding, information extraction, and question answering.

    What is coreference resolution in NLP

    Coreference resolution is the task of identifying and linking all expressions within a text that refer to the same real-world entity, such as a person, object, or concept. In practical terms, the coreference resolution NLP technique involves analyzing a text and identifying all expressions that refer to a specific entity, such as “he,” “she,” “it,” or “they.” Once these expressions are identified, they are linked together to form a “coreference chain,” which represents all the different ways in which that entity is referred to in the text.

    For example, given the sentence, “John went to the store. He bought some groceries,” ; a coreference resolution model would identify that “John” and “He” both refer to the same entity and produce a cluster of coreferent mentions.

    Coreference resolution is a complex task, and it is used in a variety of applications, including information extraction, question answering, and machine translation. It is an important task in natural language processing (NLP), as it enables machines to accurately understand the meaning of a text and generate more human-like responses.

    In this post, you will learn how to use Spark NLP to perform coreference resolution.

    Let us start with a short Spark NLP introduction and then discuss the details of the coreference resolution techniques with some solid results.

    Introduction to Spark NLP

    Spark NLP is an open-source library maintained by John Snow Labs (JSL). It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment.

    Since its first release in July 2017, Spark NLP has grown into a full NLP tool, providing:

    • A single unified solution for all your NLP needs
    • Transfer learning and implementing the latest and greatest SOTA algorithms and models in NLP research
    • The most widely used NLP library in the industry (5 years in a row)
    • The most scalable, accurate, and fastest library in NLP history

    Spark NLP comes with 14,500+ pretrained pipelines and models in more than 250+ languages. It supports most NLP tasks and provides modules that can be used seamlessly in a cluster.

    Spark NLP processes the data using Pipelines, a structure that contains all the steps to be run on the input data:

    Spark NLP pipelines

    Spark NLP pipelines

    Each step contains an annotator that performs a specific task, such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation.

    An annotator in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document. In contrast, a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral.

    Setup

    To install Spark NLP and perform coreference resolution in Python, simply use your favorite package manager (conda, pip, etc.). For example:

    pip install spark-nlp
    pip install pyspark

    For other installation options for different environments and machines, please check the official documentation.

    Then, import the library and start a Spark session:

    import sparknlp
    
    # Start Spark Session
    spark = sparknlp.start()

    Defining the Spark NLP Pipeline

    The SpanBertCoref annotator expects DOCUMENT and TOKEN as input, and then will provide DEPENDENCY as output. Thus, we need the previous steps to generate those annotations that will be used as input to our annotator.

    Spark NLP has the pipeline approach and the pipeline will include the necessary stages.

    Please check Unraveling Coreference Resolution in NLP here! for the examples and explanations below.

    The first example is for this text:

    NLP model detects entities in the text.

    Here, “Ana”, “Natural Language Processing” and “UT Dallas” are possible entities.

    “She” and “Her” are references to the entity “Ana” and “the institute” is a reference to the entity “UT Dallas”.
    # Import the required modules and classes
    from sparknlp.base import DocumentAssembler, Pipeline
    from sparknlp.annotator import (
        SentenceDetector,
        Tokenizer,
        SpanBertCorefModel
    )
    import pyspark.sql.functions as F
    
    # Step 1: Transforms raw texts to `document` annotation
    document = DocumentAssembler() \
                .setInputCol("text") \
                .setOutputCol("document")
    
    # Step 2: Sentence Detection
    sentenceDetector = SentenceDetector() \
                .setInputCols("document") \
                .setOutputCol("sentences")
    
    # Step 3: Tokenization
    token = Tokenizer() \
                .setInputCols("sentences") \
                .setOutputCol("tokens") \
                .setContextChars(["(", ")", "?", "!", ".", ","])
    
    # Step 4: Coreference Resolution
    corefResolution= SpanBertCorefModel().pretrained("spanbert_base_coref")\
                .setInputCols(["sentences", "tokens"]) \
                .setOutputCol("corefs") \
                .setCaseSensitive(False)
                
    # Define the pipeline
    pipeline = Pipeline(stages=[document, sentenceDetector, token, corefResolution])
    
    # Create the dataframe
    data = spark.createDataFrame([["Ana is a Graduate Student at UT Dallas. She loves working in Natural Language Processing at the Institute. Her hobbies include blogging, dancing and singing."]]).toDF("text")
    
    # Fit the dataframe to the pipeline to get the model
    model = pipeline.fit(data)

    Let us transform in order to get a prediction and determine the related entities:

    model.transform(data).selectExpr("explode(corefs) AS coref").selectExpr("coref.result as token", "coref.metadata").show(truncate=False)
    Extracting entities and their metadata from the text.

    The data frame shows the extracted entities and their metadata

    One-liner alternative

    In October 2022, John Snow Labs released the open-source johnsnowlabs library that contains all the company products, open-source and licensed, under one common library. This simplified the workflow, especially for users working with more than one of the libraries (e.g., Spark NLP + Healthcare NLP). This new library is a wrapper on all of John Snow Lab’s libraries and can be installed with pip:

    pip install johnsnowlabs

    Please check the official documentation for more examples and usage of this library. To run Language Detection with one line of code, we can simply:

    # Import the NLP module which contains Spark NLP and NLU libraries
    from johnsnowlabs import nlp
    
    sample_text= "Ana is a Graduate Student at UT Dallas. She loves working in 
    Natural Language Processing at the Institute. Her hobbies include blogging, 
    dancing and singing."
    
    # Returns a pandas Data Frame, we select the desired columns
    nlp.load('en.coreference.spanbert').predict(sample_text, output_level='sentence')
    Sentence detection by the one-liner model.

    The resulting data frame produced by the one-liner model

    The reason for the difference between the one-liner’s results and the previous results is here the model’s case sensitivity was ON and did not detect ‘the Institute.’

    The one-liner is based on default models for each NLP task. Depending on your requirements, you may want to use the one-liner for simplicity or customize the pipeline to choose specific models that fit your needs.

    NOTE: when using only the johnsnowlabs library, make sure you initialize the spark session with the configuration you have available. Since some libraries are licensed, you may need to set the path to your license file. If you are only using the open-source library, you can start the session with spark = nlp.start(nlp=False). The default parameters for the start function include using the licensed Healthcare NLP library with nlp=True, but we can set that to False and use all the resources of the open-source libraries such as Spark NLP, Spark NLP Display, and NLU.

    The second example is much longer and more complicated.

    The paragraph involves a person and a company’s names mentioned in multiple ways, and the model was able to detect them all.

    NLP model detects person and a company’s names mentioned in multiple ways.

    We will use the same model but feed the text above:

    data_2 = spark.createDataFrame([[""" "I had no idea I was getting in so deep," says Mr. Kaye, who founded Justin in 1982. Mr. Kaye had sold Capetronic Inc., a Taiwan electronics maker, and retired, only to find he was bored. With Justin, he began selling toys and electronics made mostly in Hong Kong, beginning with Mickey Mouse radios. The company has grown - to about 40 employees, from four initially, Mr Kaye says. Justin has been profitable since 1986."""]]).toDF("text")
    
    model = pipeline.fit(data_2)
    
    model.transform(data_2).selectExpr("explode(corefs) AS coref").selectExpr("coref.result as token", "coref.metadata").show(truncate=False)
    Extracting entities and their metadata from the complex text.

    The data frame shows the extracted entities and their metadata

    For additional information, please consult the following references:

    Conclusion

    SpanBertCoref annotator of Spark NLP is a coreference resolution model based on SpanBert, which identifies expressions that refer to the same entity in a text.

    Coreference resolution NLP models produce a mapping of all the expressions in a text that refer to the same real-world entity. Coreference resolution tasks can be a challenging, particularly in cases where there are multiple potential referents for a given expression or when the referent is implicit or ambiguous.

    How useful was this post?

    Try Healthcare NLP

    See in action
    Avatar photo
    Researcher and Data Scientist
    Our additional expert:
    A Researcher and Data Scientist with demonstrated success delivering innovative policies and machine learning algorithms, having strong statistical skills, and presenting to all levels of leadership to improve decision making. Experience in Education, Logistics, Data Analysis and Data Science. Strong education professional with a Doctor of Philosophy (Ph.D.) focused on Mechanical Engineering from Boğaziçi University.

    Named Entity Recognition (NER) with Python at Scale

    Using Spark NLP in Python to identify named entities in texts at scale. Named Entity Recognition with Python TL;DR: Named Entity Recognition...