Fine-Tuning and Automated Parameter Optimization for Medical Named Entity Recognition

14.07.2021

Alberto Andreotti

Senior data scientist on the Spark NLP team

We are glad to announce that Spark NLP for Healthcare 3.1.2 has been released! This release comes with new features, new models, bug fixes, and examples.

Highlights

Support for Fine-tuning of Ner models.
More built-in(pre-defined) graphs for MedicalNerApproach.
Date Normalizer.
New Relation Extraction Models for ADE.
Bug Fixes.
Support for user-defined Custom Transformer.
Java Workshop Examples.
Deprecated Compatibility class in Python.

Support for Fine Tuning of Ner models

Users can now resume training/fine-tune existing(already trained) Spark NLP MedicalNer models on new data. Users can simply provide the path to any existing MedicalNer model and train it further on the new dataset:

ner_tagger = MedicalNerApproach().setPretrainedModelPath("/path/to/trained/medicalnermodel")

If the new dataset contains new tags/labels/entities, users can choose to override existing tags with the new ones. The default behavior is to reset the list of existing tags and generate a new list from the new dataset. It is also possible to preserve the existing tags by setting the ‘overrideExistingTags’ parameter:

ner_tagger = MedicalNerApproach()\
.setPretrainedModelPath("/path/to/trained/medicalnermodel")\
.setOverrideExistingTags(False)

Setting overrideExistingTags to false is intended to be used when resuming trainig on the same, or very similar dataset (i.e. with the same tags or with just a few different ones).

If tags overriding is disabled, and new tags are found in the training set, then the approach will try to allocate them to unused output nodes, if any. It is also possible to override specific tags of the old model by mapping them to new tags:

ner_tagger = MedicalNerApproach()\
.setPretrainedModelPath("/path/to/trained/medicalnermodel")\
.setOverrideExistingTags(False)\
.setTagsMapping("B-PER,B-VIP", "I-PER,I-VIP")

In this case, the new tags ‘B-VIP’ and ‘I-VIP’ will replace the already trained tags ‘B-PER’ and ‘I-PER’. Unmapped old tags will remain in use and unmapped new tags will be allocated to new output nodes if any.

Jupyter Notebook:

https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.4.Resume_MedicalNer_Model_Training.ipynb

More built-in graphs for MedicalNerApproach

Seventy new TensorFlow graphs have been added to the library of available graphs which are used to train MedicalNer models. The graph with the optimal set of parameters is automatically chosen by MedicalNerApproach during training.

DateNormalizer

New annotator that normalize dates to the format YYYY/MM/DD.

This annotator identifies dates in chunk annotations, and transform these dates to the format YYYY/MM/DD.

Both the input and output formats for the annotator are chunk.

Example

Python

sentences = [
    ['08/02/2018'],
    ['11/2018'],
    ['11/01/2018'],
    ['12Mar2021'],
    ['Jan 30, 2018'],
    ['13.04.1999'],
    ['3April 2020'],
    ['next monday'],
    ['today'],
    ['next week'],
]
df = spark.createDataFrame(sentences).toDF("text")
document_assembler = DocumentAssembler().setInputCol('text').setOutputCol('document')
chunksDF = document_assembler.transform(df)
aa = map_annotations_col(chunksDF.select("document"),
lambda x: [Annotation('chunk', a.begin, a.end, a.result, a.metadata, a.embeddings) for a in x], "document",
"chunk_date", "chunk")
dateNormalizer = DateNormalizer().setInputCols('chunk_date').setOutputCol('date').setAnchorDateYear(2021).setAnchorDateMonth(2).setAnchorDateDay(27)
dateDf = dateNormalizer.transform(aa)
dateDf.select("date.result","text").show()

  
    +-----------+----------+
    |text    | date  |
    +-----------+----------+
    |08/02/2018 |2018/08/02|
    |11/2018   |2018/11/DD|
    |11/01/2018 |2018/11/01|
    |12Mar2021  |2021/03/12|
    |Jan 30, 2018|2018/01/30|
    |13.04.1999 |1999/04/13|
    |3April 2020 |2020/04/03|
    |next Monday |2021/06/19|
    |today    |2021/06/12|
    |next week  |2021/06/19|
    +-----------+----------+

New Relation Extraction Models for ADE

We are releasing new Relation Extraction models for ADE (Adverse Drug Event). The new models are available as instances of RelationExtraction and Bert based RelationExtractionDL annotators, and are capable of linking drugs with ADE mentions.

Example

Python

ade_re_model = new RelationExtractionModel().pretrained('ner_ade_clinical', 'en', 'clinical/models')\
.setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"])\
.setOutputCol("relations")\
.setPredictionThreshold(0.5)\
.setRelationPairs(['ade-drug', 'drug-ade'])
pipeline = Pipeline(stages=[documenter, sentencer, tokenizer, pos_tagger, words_embedder, ner_tagger, ner_converter, 
dependency_parser, re_ner_chunk_filter, re_model])
text ="""A 30 year old female presented with tense bullae due to excessive use of naproxin, and leg cramps relating to oxaprozin."""

p_model = pipeline.fit(spark.createDataFrame([[text]]).toDF("text"))

result = p_model.transform(data)

Results

|  |chunk1    |entity1  | chunk2    |entity2  |  result |
    |---:|:--------------|:-----------|:--------------|:----------|-----------:|
    | 0 |tense bullae |ADE    |naproxin   |DRUG   |     1 |
    | 1 |tense bullae |ADE    |oxaprozin   |DRUG   |     0 |
    | 2 |naproxin   |DRUG    |leg cramps  |ADE    |     0 |
    | 3 |leg cramps  |ADE    |oxaprozin   |DRUG   |     1 |

Benchmarking

Model: re_ade_clinical

       precision  recall f1-score  support
         0    0.85   0.89   0.87   1670
         1    0.88   0.84   0.86   1673
     micro avg    0.87   0.87   0.87   3343
     macro avg    0.87   0.87   0.87   3343
    weighted avg    0.87   0.87   0.87   3343

Model: redl_ade_biobert

Relation      Recall Precision    F1  Support
    0          0.894   0.946   0.919   1011
    1          0.963   0.926   0.944   1389
    Avg.        0.928   0.936   0.932
    Weighted Avg.    0.934   0.934   0.933

Support for user-defined Custom Transformer

Utility classes to define custom transformers in python are included in this release. This allows users to define functions in Python to manipulate Spark-NLP annotations. These new Transformers can be added to pipelines like any of the other models you’re already familiar with.

Example of how to use the custom transformer.

    defmyFunction(annotations):
          # lower case the content of the annotations
          return[a.copy(a.result.lower()) forainannotations]
    
        custom_transformer=CustomTransformer(f=myFunction).setInputCol("ner_chunk").setOutputCol("custom")
        outputDf=custom_transformer.transform(outdf).select("custom").toPandas()

Java Workshop Examples

New Java examples were added to the workshop repository. https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/java/healthcare

Deprecated Compatibility class in Python

Due to active release cycle, we are adding & training new pretrained models at each release and it might be tricky to maintain the backward compatibility or keep up with the latest models and versions, especially for the users using our models locally in air-gapped networks.

We are releasing a new utility class to help you check your local & existing models with the latest version of everything we have up to date. You will not need to specify your AWS credentials from now on. This new class is now replacing the previous Compatibility class written in Python and CompatibilityBeta class written in Scala.

from sparknlp_jsl.compatibility import Compatibility
compatibility = Compatibility(spark)
print(compatibility.findVersion('sentence_detector_dl_healthcare'))

Output

[{'name': 'sentence_detector_dl_healthcare', 'sparkVersion': '2.4', 'version': '2.6.0', 'language': 'en', 'date': '2020-09-13T14:44:42.565', 'readyToUse': 'true'}, {'name': 'sentence_detector_dl_healthcare', 'sparkVersion': '2.4', 'version': '2.7.0', 'language': 'en', 'date': '2021-03-16T08:42:34.391', 'readyToUse': 'true'}]

Installing and Quickstart Instructions:

python3 -m pip install --upgrade spark-nlp-jsl==3.1.2 --user --extra-index-url https://pypi.johnsnowlabs.com/3.1.2-8a8d35167f9758a982fcb9d0a831267887132710

Online Quickstart (requires Internet):

Either use our quick SparkSession starter:

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.common import *
import sparknlp_jsl
spark = sparknlp_jsl.start("3.1.2-8a8d35167f9758a982fcb9d0a831267887132710")

Or create SparkSession manually:

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.common import *
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .appName('Spark NLP') \
    .config("spark.driver.memory", "32G") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.2") \
    .config("spark.jars", "https://pypi.johnsnowlabs.com/3.1.2-8a8d35167f9758a982fcb9d0a831267887132710/spark-nlp-jsl-3.1.2.jar")\
    .getOrCreate()

Try Medical LLMs

See in action

Alberto Andreotti

Senior data scientist on the Spark NLP team

Our additional expert:

Alberto Andreotti is a senior data scientist on the Spark NLP team at John Snow Labs, where he implements state-of-the-art NLP algorithms on top of Spark. He has a decade of experience working for companies and as a consultant, specializing in the field of machine learning. Alberto has written lots of low-level code in C/C++ and was an early Scala enthusiast and developer. A lifelong learner, he holds degrees in engineering and computer science and is working on a third in AI. Alberto was born in Argentina. He enjoys the outdoors, particularly hiking and camping in the mountains of Argentina.

2600+ New Models for 200+ Languages and 10+ Dimension Reduction Algorithms for Streamlit Word-Embedding visualizations in 3-D with NLU

Christian Kasim Loan

We are extremely excited to announce the release of NLU 3.1! This is our biggest release so far and it comes with...

Fine-Tuning and Automated Parameter Optimization for Medical Named Entity Recognition

Highlights

Support for Fine Tuning of Ner models

More built-in graphs for MedicalNerApproach

DateNormalizer

Example

New Relation Extraction Models for ADE

Example

Support for user-defined Custom Transformer

Java Workshop Examples

Deprecated Compatibility class in Python

Installing and Quickstart Instructions:

Online Quickstart (requires Internet):

2600+ New Models for 200+ Languages and 10+ Dimension Reduction Algorithms for Streamlit Word-Embedding visualizations in 3-D with NLU

Recommended For You