EntityRulerInternal in Spark NLP extracts medical entities from text using regex patterns or exact matches defined in JSON or CSV files. With practical examples, this post explains how to set it up and use it in a Healthcare NLP pipeline.
Named Entity Recognition (NER) is a critical task in Natural Language Processing (NLP), especially in the healthcare domain where accurately identifying medical entities from unstructured text is essential. Besides the diverse Named Entity Recognition (NER) models available in our Models Hub, like the Healthcare NLP MedicalNerModel using Bidirectional LSTM-CNN architecture and BertForTokenClassification, our library also includes powerful rule-based annotators such as ContextualParser, TextMatcher, RegexMatcher, and EntityRulerInternal. While deep learning models excel in capturing complex patterns, rule-based approaches offer precision for specific patterns. Spark NLP’s EntityRulerInternal bridges this gap, combining the power of regex and string matching with the scalability of Spark.
EntityRulerInternal Overview
EntityRulerInternal is an annotator in Healthcare NLP that matches exact strings or regex patterns provided in a file against a document and assigns them a named entity. This approach allows for the precise identification of entities using predefined rules, enhancing the accuracy and consistency of entity extraction.
Parameters:
- setPatternsResource (str): Path to the resource file (JSON or CSV) mapping entities to patterns.
path : str Path to the resource read_as : str, optional How to interpret the resource, by default ReadAs.TEXT options : dict, optional Options for parsing the resource, by default {"format": "JSON"}
- setSentenceMatch (Boolean): Determines whether to match at the sentence level (True) or token level (False).
- setAlphabetResource (str): Path to a plain text file containing all language characters.
- setUseStorage (Boolean): Enables the use of RocksDB storage for pattern serialization.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: CHUNK
Setup
First, we need to set up the Spark NLP Healthcare library. Follow the detailed instructions provided in the official documentation.
Additionally, refer to the Healthcare NLP GitHub repository for sample notebooks demonstrating setup on Google Colab under the “Colab Setup” section.
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal. ! pip install -q johnsnowlabs
from google.colab import files print('Please Upload your John Snow Labs License using the button below') license_keys = files.upload()
from johnsnowlabs import nlp, medical # After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM nlp.settings.enforce_versions=True nlp.install(refresh_install=True)
from johnsnowlabs import nlp, medical import pandas as pd # Automatically load license data and start a session with all jars user has access to spark = nlp.start()
EntityRulerInternalApproach:
1. Define Patterns:
Create a JSON file containing patterns and their corresponding entities. You can define CSV format to map entities to patterns by providing the setPatternsResource() parameter as well.
import json data = [ { "id": "drug-words", "label": "Drug", "patterns": ["paracetamol", "aspirin", "ibuprofen", "lansoprazol"] }, { "id": "disease-words", "label": "Disease", "patterns": ["heart condition","tonsilitis","GORD"] }, { "id": "symptom-words", "label": "Symptom", "patterns": ["fever","headache"] }, ] with open("entities.json", "w") as f: json.dump(data, f)
2. Load Patterns:
Load the patterns into the EntityRulerInternal annotator.
entityRuler = EntityRulerInternalApproach()\ .setInputCols(["document", "token"])\ .setOutputCol("entities")\ .setPatternsResource("entities.json")\ .setCaseSensitive(False)\
3. Build the Pipeline:
Define and run a pipeline including the EntityRulerInternal annotator.
document_assembler = nlp.DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") sentence_detector = nlp.SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentence") pipeline = nlp.Pipeline(stages=[ document_assembler, sentence_detector, entity_ruler ]) data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text") result = pipeline.fit(data).transform(data)
4. View Results:
Extract and display the results.
result.select(F.explode(F.arrays_zip( result.entities.result, result.entities.begin, result.entities.end, result.entities.metadata,)).alias("cols"))\ .select(F.expr("cols['0']").alias("chunk"), F.expr("cols['1']").alias("begin"), F.expr("cols['2']").alias("end"), F.expr("cols['3']['entity']").alias('label')).show(truncate=30)
Output:
+---------------+-----+---+-------+ | chunk|begin|end| label| +---------------+-----+---+-------+ | aspirin| 25| 31| Drug| |heart condition| 41| 55|Disease| | paracetamol| 69| 79| Drug| | fever| 89| 93|Symptom| | headache| 99|106|Symptom| | tonsilitis| 129|138|Disease| | ibuprofen| 141|149| Drug| | lansoprazol| 177|187| Drug| | GORD| 198|201|Disease| +---------------+-----+---+-------+
Regex Patterns:
As shown in the example below we can define a regex pattern to detect entities. We’ll extract the Date entity using a regex pattern within the JSON.
import json data = [ { "id": "date-regex", "label": "Date", "patterns": ["\\d{4}-\\d{2}-\\d{2}","\\d{4}"], "regex": True }, { "id": "drug-words", "label": "Drug", "patterns": ["paracetamol", "aspirin", "ibuprofen", "lansoprazol"] }, { "id": "disease-words", "label": "Disease", "patterns": ["heart condition","tonsilitis","GORD"] }, { "id": "symptom-words", "label": "Symptom", "patterns": ["fever","headache"] }, ] with open("entities.json", "w") as f: json.dump(data, f)
entityRuler = EntityRulerInternalApproach()\ .setInputCols(["document", "token"])\ .setOutputCol("entities")\ .setPatternsResource("entities.json")\ .setCaseSensitive(False)\ pipeline = Pipeline().setStages([ documentAssembler, tokenizer, entityRuler ]) data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text") model = pipeline.fit(data) result = model.transform(data)
result.select(F.explode(F.arrays_zip( result.entities.result, result.entities.begin, result.entities.end, result.entities.metadata,)).alias("cols"))\ .select(F.expr("cols['0']").alias("chunk"), F.expr("cols['1']").alias("begin"), F.expr("cols['2']").alias("end"), F.expr("cols['3']['entity']").alias('label')).show(truncate=30)
Output:
+---------------+-----+---+-------+ | chunk|begin|end| label| +---------------+-----+---+-------+ | 2023-12-01| 206|215| Date| | aspirin| 25| 31| Drug| |heart condition| 41| 55|Disease| | paracetamol| 69| 79| Drug| | fever| 89| 93|Symptom| | headache| 99|106|Symptom| | tonsilitis| 129|138|Disease| | ibuprofen| 141|149| Drug| | lansoprazol| 177|187| Drug| | GORD| 198|201|Disease| +---------------+-----+---+-------+
EntityRulerInternalModel :
This annotator is an instantiated model of theEntityRulerInternalApproach
. Once you build an EntityRulerInternalApproach()
, you can save it and use it with EntityRulerInternalModel()
via load()
function. Let’s re-build one of the examples that we have done before and save it.
data = spark.createDataFrame([["John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01."]]).toDF("text") data.show(truncate=False)
Output:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |text | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.| +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Saving the approach to disk:
model.stages[-1].write().overwrite().save("ruler_approach_model")
Loading the saved model and using it with the via
load
.
entity_ruler = EntityRulerInternalModel.load('/content/ruler_approach_model') \ .setInputCols(["document", "token"])\ .setOutputCol("entities") pipeline = Pipeline(stages=[documentAssembler, tokenizer, entity_ruler]) pipeline_model = pipeline.fit(data) result = pipeline_model.transform(data
Conclusion
EntityRulerInternal in Healthcare NLP offers a comprehensive solution for integrating rule-based and deep-learning approaches in medical entity extraction. By leveraging predefined patterns and the scalability of Spark, it ensures high precision and efficiency in processing clinical texts. This hybrid approach combines the flexibility of deep learning models with the accuracy of rule-based methods, making it highly adaptable for various medical domains and use cases.
EntityRulerInternal allows healthcare professionals and researchers to extract relevant medical information from large amounts of unstructured data, improving clinical decision-making and patient care. Its customizable patterns can identify drug names, medical conditions, and procedural terms, tailored to specific needs.
Overall, EntityRulerInternal is a valuable tool in the Spark NLP suite, empowering users to harness the full potential of both rule-based and machine-learning techniques in their NLP workflows. For more detailed examples and advanced usage, refer to the Spark NLP Workshop and explore the capabilities of Spark NLP’s healthcare library.
Try Healthcare NLP
See in action