Detect Radiology Related Entities with Spark NLP

17.03.2021

David Cecchini

Data Scientist at John Snow Labs

In this post, we will explain how to use the new Spark NLP‘s pre-trained NER NLP model ner_radiology that can identify entities related to radiology.

Introduction

Named Entity Recognition (NER) is a dynamic field of Natural Language Processing (NLP) research and is defined as an automatic system to detect “entities” in a text. These entities depend on the application scenario and can be People, Location, and Organization for a generic news article text, or, as in the case here, entities related to radiology.

The capability to automatically identify radiology entities can help health practitioners to structure and organize big data sets at scale. Usually, without an automatic model, practitioners need to manually label the entities in the text if they want to analyze the data they have or create a model that uses this information as input.

John Snow Labs’ Spark NLP

Spark NLP is an open-source Java and Python NLP library built on top of Apache Spark from John Snow Labs. It is one of the top growing NLP production-ready solutions with support for all main tasks related to NLP in more than 46 languages.

Figure 1: Open-Source Spark NLP

Apart from the open-source library, the company maintains the licensed library specialized for healthcare solutions, SPARK NLP for Healthcare.

Figure 2: Spark NLP for Healthcare

Spark NLP for Healthcare contains a pre-trained model for Named Entity Recognition in the scope of radiology.

Spark NLP pipeline

Spark NLP is built on top of Apache Spark and is based on the concept of pipelines, where we can perform several NLP tasks in our data with a unified pipeline. For an introduction of Spark NLP functionalities, refer to [1].

Here, we will explore the Spark NLP for Healthcare capabilities for Radiology NER using the new model.

The new ner_radiology model was trained on a custom dataset comprising of MIMIC-CXR [5] and MT Radiology texts, and can identify the following entities:

ImagingTest: Name of an imaging test (Ultrasound)
Imaging_Technique: Name of the technique used
ImagingFindings: Diagnostic or identifications found on tests (Ovoid mass)
OtherFindings: Other findings
BodyPart: Name of anatomic part of the body (Shoulder)
Direction: Detail direction of the image test (Bilateral
Test: Name of the test (not imaging)
Symptom: Symptom identified
Disease_Syndrome_Disorder: Name of health condition (Lipoma)
Medical_Device: Name of medical devices
Procedure: Name of procedure
Measurements: Measurements such as 0.5 x 0.4
Units: Units such as cm, ml

In Spark NLP, we build a pipeline containing all the steps (stage) of the model, and in the case of NER models, we need to transform the text into a document annotator (DocumentAssembler), identify sentences (SentenceDetector), tokenize (Tokenizer), transform to embeddings vectors (WordEmbeddingsModel), identify entities (NerModel) and finally convert the entities to standard formats (NerConverter). The full pipeline in Python can be seen below.

# Import modules from pyspark.ml.pipeline import PipelineModel from sparknlp.base import DocumentAssembler from sparknlp.annotator import SentenceDetector, Tokenizer, WordEmbeddingsModel from sparknlp.annotator import NerDLModel, NerConverter document_assembler = DocumentAssembler() \         .setInputCol("text") \         .setOutputCol("document") sentence_detector = SentenceDetector()\         .setInputCols(["document"])\         .setOutputCol("sentence") tokenizer = Tokenizer()\         .setInputCols(["sentence"])\         .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\         .setInputCols("document", "token") \         .setOutputCol("embeddings") ner =  NerDLModel.pretrained("ner_radiology", "en", "clinical/models") \         .setInputCols(["document", "token", "embeddings"]) \         .setOutputCol("ner") ner_converter = NerConverter() \         .setInputCols(["document", "token", "ner"]) \         .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[         document_assembler,         sentence_detector,         tokenizer,         embeddings,         ner,         ner_converter])

We then can initialize the pipeline on an example:

Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.

To do that, we will use two pipeline objects: the Pipeline and the LightPipeline. The first one is useful to keep all the stages in one single object and is used on spark data frames, while the second one is useful for fast predictions and can be used directly on strings (or list of strings).

# Send example to spark data frame
        example = spark.createDataFrame([['''Bilateral breast ultrasound was subsequently performed, which 
        demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the 
        anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the 
        adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue 
        or a lipoma.''']]).toDF("text")
        
        # Initialize the model
        model = nlpPipeline.fit(example)
        
        # Make predictions on the example (full model)
        result = model.transform(example)
        
        # Create a light model for fast predictions
        lmodel = LightPipeline(model)
        
        # Use texts directly
        result_light = lmodel.fullAnnotate("Bilateral breast ultrasound was subsequently performed, which 
        demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within 
        the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to 
        the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous 
        tissue or a lipoma."

To visualize the results of the light model, we can run:

for res in result_light[0]['ner_chunk']:
        print(f"Chunk: {res.result}, Entity: {res.metadata['entity']}")

Resulting:

Chunk: Bilateral breast, Entity: BodyPart Chunk: ultrasound, Entity: ImagingTest Chunk: ovoid mass, Entity: ImagingFindings Chunk: 0.5 x 0.5 x 0.4, Entity: Measurements Chunk: cm, Entity: Units Chunk: anteromedial aspect of the left shoulder, Entity: BodyPart Chunk: mass, Entity: ImagingFindings Chunk: isoechoic echotexture, Entity: ImagingFindings Chunk: muscle, Entity: BodyPart Chunk: internal color flow, Entity: ImagingFindings Chunk: benign fibrous tissue, Entity: ImagingFindings Chunk: lipoma, Entity: Disease_Syndrome_Disorder

And for the full model:

result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']['entity']").alias("entities"))\
        .show(truncate=False)

Resulting:

chunk	entities
Bilateral	Direction
breast	BodyPart
ultrasound	ImagingTest
ovoid mass	ImagingFindings
0.5 x 0.5 x 0.4	Measurements
cm	Units
anteromedial aspect	Direction
left	Direction
shoulder	BodyPart
mass	ImagingFindings
isoechoic echotexture	ImagingFindings
muscle	BodyPart
internal color flow	ImagingFindings
benign fibrous tissue	ImagingFindings
lipoma	Disease_Syndrome_Disorder

Overall, the model achieves good performance on the entities (except for the OtherFindings that is less relevant), as can be seen in the table below.

entity	tp	fp	fn	total	precision	recall	f1
OtherFindings	8	15	63	71	0.3478	0.1127	0.1702
Measurements	481	30	15	496	0.9413	0.9698	0.9553
Direction	650	137	94	744	0.8259	0.8737	0.8491
ImagingFindings	1,345	355	324	1,669	0.7912	0.8059	0.7985
BodyPart	1,942	335	290	2,232	0.8529	0.8701	0.8614
Medical_Device	236	75	64	300	0.7588	0.7867	0.7725
Test	222	41	48	270	0.8441	0.8222	0.8330
Procedure	269	117	116	385	0.6969	0.6987	0.6978
ImagingTest	263	50	43	306	0.8403	0.8595	0.8498
Symptom	498	101	132	630	0.8314	0.7905	0.8104
Disease_Syndrome_Disorder	1,180	258	200	1,380	0.8206	0.8551	0.8375
Units	269	10	2	271	0.9642	0.9926	0.9782
Imaging_Technique	140	38	25	165	0.7865	0.8485	0.8163

The model achieves and micro accuracy of 0.8315 and a macro accuracy of 0.7524.

Conclusion

In this post, we introduced a new clinical Named Entity Resolution model that identifies entities related to radiology. If you want to try them on your data, you can ask for a Spark NLP Healthcare free trial license.

Being used in enterprise projects and built natively on Apache Spark and TensorFlow as well as offering all-in-one state-of-the-art NLP solutions, Spark NLP library provides simple, performant as well as accurate NLP notations for machine learning pipelines that can scale easily in a distributed environment.

If you want to find out more and start practicing Spark NLP, please check out the reference resources below.

Detect Radiology Related Entities with Spark NLP

Introduction

John Snow Labs’ Spark NLP

Spark NLP pipeline

We then can initialize the pipeline on an example:

To visualize the results of the light model, we can run:

Resulting:

And for the full model:

Resulting:

Conclusion

Further reading

White Swan's Impressions of John Snow Labs' Clinical Named Entity Recognition

Join the Global Healthcare AI Community

Detect Radiology Related Entities with Spark NLP

Introduction

John Snow Labs’ Spark NLP

Spark NLP pipeline

We then can initialize the pipeline on an example:

To visualize the results of the light model, we can run:

Resulting:

And for the full model:

Resulting:

Conclusion

Further reading

White Swan's Impressions of John Snow Labs' Clinical Named Entity Recognition

Recommended For You