Spark NLP for Healthcare 3.1 improves the accuracy, functionality, and ease of use of the library’s data de-identification capabilities, whose are crutial for natural language processing in healthcare. All improvements come directly from customer feedback, as the library is being used in real-world projects to anonymize millions of medical notes, clinical trial documents, scanned PDF reports & DICOM images. Highlights include:
- New Deidentification Named Entity Recognition Models
- New column returned in DeidentificationModel
- New Re-identification feature
- Extended regex dictionary fuctionality in de-identification
- Chunk filtering based on confidence
- New de-identification pretrained pipelines
Accuracy: New Deidentification Named Entity Recognition (NER) Models
Four new NER models have been trained to identity PHI (protected health information) data that may need to be deidentified. ner_deid_generic_augmented and ner_deid_subentity_augmented models are trained with a combination of the 2014 i2b2 Deid dataset and in-house annotations as well as an augmented version of them. Compared to the same test set coming from the 2014 i2b2 Deid dataset, we achieved better accuracy and generalization on several entity labels as summarized in the following tables. We also trained the same models with glove_100d embeddings to provide more memory-friendly versions
- ner_deid_generic_augmented: Detects PHI 7 entities
(DATE,NAME,LOCATION,PROFESSION,CONTACT,AGE,ID).
Models Hub Page:
https://nlp.johnsnowlabs.com/2021/06/01/ner_deid_generic_augmented_en.html
entity | ner_deid_large (v3.0.3 and before) | ner_deid_generic_augmented (v3.1.0) |
CONTACT |
0.8695 |
0.9592 |
NAME |
0.9452 |
0.9648 |
DATE |
0.9778 |
0.9855 |
LOCATION | 0.8755 |
0.923 |
(MEDICALRECORD,ORGANIZATION,DOCTOR,USERNAME,PROFESSION,HEALTHPLAN,URL,CITY,DATE,LOCATION-OTHER,STATE,PATIENT,DEVICE,COUNTRY,ZIP,PHONE,HOSPITAL,EMAIL,IDNUM,SREET,BIOID,FAX,AGE)
Models Hub Page:
https://nlp.johnsnowlabs.com/2021/09/03/ner_deid_subentity_augmented_en.html
entity |
ner_deid_enriched (v3.0.3 and before) | ner_deid_subentity_augmented (v3.1.0) |
HOSPITAL | 0.8519 |
0.8983 |
DATE |
0.9766 |
0.9854 |
CITY |
0.7493 |
0.8075 |
STREET |
0.8902 |
0.9772 |
ZIP | 0.8 |
0.9504 |
PHONE | 0.8615 |
0.9502 |
DOCTOR | 0.9191 |
0.9347 |
AGE | 0.9416 |
0.9469 |
- ner_deid_generic_glove: Small version ofner_deid_generic_augmentedand detects 7 entities.
- ner_deid_subentity_glove: Small version ofner_deid_subentity_augmentedand detects 23 entities.
Example:
Python
deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, ner_converter]) model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."""]})))
Results:
Functionality: New column returned in DeidentificationModel
DeidentificationModel now can return a new column to save the mappings between the mask/obfuscated entities and original entities. This column is optional and you can set it up with the.setReturnEntityMappings(True)method. The default value is False. Also, the name for the column can be changed using the following method;.setMappingsColumn(“newAlternativeName”)The new column will produce annotations with the following structure,
Annotation( type: chunk, begin: 17, end: 25, result: 47, metadata:{ originalChunk - 01/13/93 //Original text of the chunk chunk - 0 // The number of the chunk in the sentence beginOriginalChunk - 95 // Start index of the original chunk endOriginalChunk - 102 // End index of the original chunk entity - AGE // Entity of the chunk sentence - 2 // Number of the sentence } )
Functionality: New Re-identification feature
With the new ReidetificationModel, the user can go back to the original sentences using the mappings columns and the deidentification sentences.
Example:
reDeidentification =ReIdentification() .setInputCols(["mappings","deid_chunks"]) .setOutputCol("original")
Functionality: Filtering Entities Based on Confidence
We added a new annotator ChunkFiltererApproach that allows loading a CSV file with both entities and confidence thresholds. This annotator will produce a ChunkFilterer model.
This annotator can be used to filter named entity for de-identification – but also any other type of recognized named entity, as the example below shows.
You can load the dictionary with the following propertysetEntitiesConfidenceResource().
An example dictionary is:
TREATMENT,0.7
With that dictionary, the user can filter the chunks corresponding to treatment entities which have confidence lower than 0.7.
Example:
We have a ner_chunk column and sentence column with the following data:
Ner_chunk
|[{chunk, 141, 163, the genomicorganization, {entity - TREATMENT, sentence - 0, chunk - 0, confidence - 0.57785}, []}, {chunk, 209, 267, a candidate gene forType II diabetes mellitus, {entity - PROBLEM, sentence - 0, chunk - 1, confidence - 0.6614286}, []}, {chunk, 394, 408, byapproximately, {entity - TREATMENT, sentence - 1, chunk - 2, confidence - 0.7705}, []}, {chunk, 478, 508, single nucleotide polymorphisms, {entity - TREATMENT, sentence - 2, chunk - 3, confidence - 0.7204666}, []}, {chunk, 559, 581, aVal366Ala substitution, {entity - TREATMENT, sentence - 2, chunk - 4, confidence - 0.61505}, []}, {chunk, 588, 601, an 8 base-pair, {entity - TREATMENT, sentence - 2, chunk - 5, confidence - 0.29226667}, []}, {chunk, 608, 625, insertion/deletion, {entity - PROBLEM, sentence - 3, chunk - 6, confidence - 0.9841}, []}]| +-------
Sentence
[{document, 0, 298, The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family.Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population., {sentence - 0}, []}, {document, 300, 460, The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons ,separated byapproximately 2.2 and approximately 2.6 kb introns, respectively., {sentence - 1}, []}, {document, 462, 601, We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair, {sentence - 2}, []}, {document, 603, 626, (bp) insertion/deletion., {sentence - 3}, []}]
We can filter the entities using the following annotator:
chunker_filter=ChunkFiltererApproach().setInputCols("sentence", "ner_chunk") \ .setOutputCol("filtered") \ .setCriteria("regex") \ .setRegex([".*"]) \ .setEntitiesConfidenceResource("entities_confidence.csv")
Where entities-confidence.csv has the following data:
TREATMENT,0.7 PROBLEM,0.9
We can use that chunk_filter:
chunker_filter.fit(data).transform(data)
Producing the following entities:
|[{chunk, 394, 408, byapproximately, {entity - TREATMENT, sentence - 1, chunk - 2, confidence - 0.7705}, []}, {chunk, 478, 508, single nucleotide polymorphisms, {entity - TREATMENT, sentence - 2, chunk - 3, confidence - 0.7204666}, []}, {chunk, 608, 625, insertion/deletion, {entity - PROBLEM, sentence - 3, chunk - 6, confidence - 0.9841}, []}]|
As you can see, only the treatment entities with a confidence score of more than 0.7, and the problem entities with a confidence score of more than 0.9 have been kept in the output.
Functionality: Extended Regex Dictionary Context
The RegexPatternsDictionary can now use a regex that spawns the 2 previous token and the 2 next tokens. That feature is implemented using regex groups.
Examples:
Given the sentence The patient with ssn 123123123 we can use the following regex to capture the entittyssn (\d{9}). Given the sentence The patient has 12 yearswe can use the following regex to capture the entitty(\d{2}) years
Ease of Use: New Pretrained De-identification Pipelines
We developed aclinical_deidentificationpretrained pipeline that can be used to de-identify PHI from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate AGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION, CITY, COUNTRY, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STREET, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, IPADDR entities.
Models Hub Page: https://nlp.johnsnowlabs.com/2021/05/27/clinical_deidentification_en.html
There is also a lightweight version of the same pipeline trained with memory efficientglove_100dembeddings. Here are the model names:
- clinical_deidentification
- clinical_deidentification_glove
Example:
Python:
from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models") deid_pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.")
Result:
{'sentence': ['Record date : 2093-01-13, David Hale, M.D.', 'IP: 203.120.223.13.', 'The driver's license no:A334455B.', 'the SSN:324598674 and e-mail: hale@gmail.com.', 'Name :Hendrickson, Ora MR. #719435 Date : 01/13/93.', 'PCP : Oliveira, 25 years-old.', 'Record date : 2079-11-09, Patient's VIN :1HGBH41JXMN109286.'], 'masked': ['Record date :<DATE, <DOCTOR, M.D.', 'IP: <IPADDR.', 'The driver's license <DLN.', 'the <SSN and e-mail: <EMAIL.', 'Name : <PATIENT MR. # <MEDICALRECORD Date : <DATE.', 'PCP : <DOCTOR, <AGE years-old.', 'Record date : <DATE, Patient's VIN :<VIN.'], 'obfuscated': ['Record date :2093-01-18, Dr Alveria Eden, M.D.', 'IP: 001.001.001.001.', 'The driver's license K783518004444.', 'the SSN-400-50-8849 and e-mail: Merilynn@hotmail.com.', 'Name : Charls Danger MR. # J3366417 Date : 01-18-1974.', 'PCP : Dr Sina Sewer, 55 years-old.', 'Record date : 2079-11-23, Patient's VIN :6ffff55gggg666777.'], 'ner_chunk': ['2093-01-13', 'David Hale', 'no:A334455B', 'SSN:324598674', 'Hendrickson, Ora', '719435', '01/13/93', 'Oliveira', '25', '2079-11-09', '1HGBH41JXMN109286']}
Get Started
- Live demos & Python notebooks of medical data de-identification tools
- Start a free trial of Spark NLP for Healthcare
- How to build a deidentification pipeline from scratch