was successfully added to your cart.

    Simpler & More Accurate Deidentification in Spark NLP for Healthcare

    Spark NLP for Healthcare 3.1 improves the accuracy, functionality, and ease of use of the library’s data de-identification capabilities, whose are crutial for natural language processing in healthcare. All improvements come directly from customer feedback, as the library is being used in real-world projects to anonymize millions of medical notes, clinical trial documents, scanned PDF reports & DICOM images. Highlights include:

    • New Deidentification Named Entity Recognition Models
    • New column returned in DeidentificationModel
    • New Re-identification feature
    • Extended regex dictionary fuctionality in de-identification
    • Chunk filtering based on confidence
    • New de-identification pretrained pipelines

    Accuracy: New Deidentification Named Entity Recognition (NER) Models

    Four new NER models have been trained to identity PHI (protected health information) data that may need to be deidentified. ner_deid_generic_augmented and ner_deid_subentity_augmented models are trained with a combination of the 2014 i2b2 Deid dataset and in-house annotations as well as an augmented version of them. Compared to the same test set coming from the 2014 i2b2 Deid dataset, we achieved better accuracy and generalization on several entity labels as summarized in the following tables. We also trained the same models with glove_100d embeddings to provide more memory-friendly versions

    • ner_deid_generic_augmented: Detects PHI 7 entities
    (DATE,NAME,LOCATION,PROFESSION,CONTACT,AGE,ID).

    Models Hub Page:

    https://nlp.johnsnowlabs.com/2021/06/01/ner_deid_generic_augmented_en.html

    entity ner_deid_large (v3.0.3 and before) ner_deid_generic_augmented (v3.1.0)

    CONTACT

    0.8695

    0.9592

    NAME

    0.9452

    0.9648

    DATE

    0.9778

    0.9855

    LOCATION 0.8755

    0.923

    (MEDICALRECORD,ORGANIZATION,DOCTOR,USERNAME,PROFESSION,HEALTHPLAN,URL,CITY,DATE,LOCATION-OTHER,STATE,PATIENT,DEVICE,COUNTRY,ZIP,PHONE,HOSPITAL,EMAIL,IDNUM,SREET,BIOID,FAX,AGE)

    Models Hub Page:

    https://nlp.johnsnowlabs.com/2021/09/03/ner_deid_subentity_augmented_en.html

    entity

    ner_deid_enriched (v3.0.3 and before) ner_deid_subentity_augmented (v3.1.0)
    HOSPITAL 0.8519

    0.8983

    DATE

    0.9766

    0.9854

    CITY

    0.7493

    0.8075

    STREET

    0.8902

    0.9772

    ZIP 0.8

    0.9504

    PHONE 0.8615

    0.9502

    DOCTOR 0.9191

    0.9347

    AGE 0.9416

    0.9469

    • ner_deid_generic_glove: Small version ofner_deid_generic_augmentedand detects 7 entities.
    • ner_deid_subentity_glove: Small version ofner_deid_subentity_augmentedand detects 23 entities.

    Example:

    Python

    deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \
    
    .setInputCols(["sentence", "token", "embeddings"]) \
    
    .setOutputCol("ner")
    
    ...
    
    nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, 
    ner_converter])
    
    model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) 
    
    
    results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, 
    David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, 
    Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."""]})))

    Results:

    Functionality: New column returned in DeidentificationModel

    DeidentificationModel now can return a new column to save the mappings between the mask/obfuscated entities and original entities. This column is optional and you can set it up with the.setReturnEntityMappings(True)method. The default value is False. Also, the name for the column can be changed using the following method;.setMappingsColumn(“newAlternativeName”)The new column will produce annotations with the following structure,

    Annotation(
    
    type: chunk,
    
    begin: 17,
    
    end: 25,
    
    result: 47,
    
    metadata:{
    
    originalChunk - 01/13/93 //Original text of the chunk
    
    chunk - 0 // The number of the chunk in the sentence
    
    beginOriginalChunk - 95 // Start index of the original chunk
    
    endOriginalChunk - 102 // End index of the original chunk
    
    entity - AGE // Entity of the chunk
    
    sentence - 2 // Number of the sentence
    
    }
    
    )

    Functionality: New Re-identification feature

    With the new ReidetificationModel, the user can go back to the original sentences using the mappings columns and the deidentification sentences.

    Example:

    reDeidentification =ReIdentification()
    .setInputCols(["mappings","deid_chunks"]) 
    .setOutputCol("original")

    Functionality: Filtering Entities Based on Confidence

    We added a new annotator ChunkFiltererApproach that allows loading a CSV file with both entities and confidence thresholds. This annotator will produce a ChunkFilterer model.

    This annotator can be used to filter named entity for de-identification – but also any other type of recognized named entity, as the example below shows.

    You can load the dictionary with the following propertysetEntitiesConfidenceResource().

    An example dictionary is:

    TREATMENT,0.7

    With that dictionary, the user can filter the chunks corresponding to treatment entities which have confidence lower than 0.7.

    Example:

    We have a ner_chunk column and sentence column with the following data:

    Ner_chunk

    |[{chunk, 141, 163, the genomicorganization, {entity - TREATMENT, sentence - 0, chunk - 0, confidence - 
    	0.57785}, []}, {chunk, 209, 267, a candidate gene forType II
    	
    		diabetes mellitus, {entity - PROBLEM, sentence - 0, chunk - 1, confidence - 0.6614286}, []}, 
    	{chunk, 394, 408, byapproximately, {entity - TREATMENT, sentence - 1, chunk - 2, confidence - 0.7705}, []}, 
    	{chunk, 478, 508, single nucleotide polymorphisms, {entity - TREATMENT, sentence - 2, chunk - 3, 
    	confidence - 0.7204666}, []}, {chunk, 559, 581, aVal366Ala substitution, {entity - TREATMENT, sentence - 
    	2, chunk - 4, confidence - 0.61505}, []}, {chunk, 588, 601, an 8 base-pair, {entity - TREATMENT, sentence - 
    	2, chunk - 5, confidence - 0.29226667}, []}, {chunk, 608, 625, insertion/deletion, {entity - PROBLEM, 
    	sentence - 3, chunk - 6, confidence - 0.9841}, []}]|
    	
    	+-------

    Sentence

    [{document, 0, 298, The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying 
    	potassium (GIRK) channel family.Here we describe the genomicorganization of the KCNJ9 locus on chromosome 
    	1q21-23 as a candidate gene forType II
    	diabetes mellitus in the Pima Indian population., {sentence - 0}, []}, {document, 300, 460, The 
    	gene spansapproximately 7.6 kb and contains one noncoding and two coding exons ,separated byapproximately 2.2 
    	and approximately 2.6 kb introns, respectively., {sentence - 1}, []}, {document, 462, 601, We identified14 
    	single nucleotide polymorphisms (SNPs), 
    		including one that predicts aVal366Ala substitution, and an 8 base-pair, {sentence - 2}, []}, 
    	{document, 603, 626, (bp) insertion/deletion., {sentence - 3}, []}]

    We can filter the entities using the following annotator:

    chunker_filter=ChunkFiltererApproach().setInputCols("sentence", "ner_chunk") \
    
    .setOutputCol("filtered") \
    
    .setCriteria("regex") \
    
    .setRegex([".*"]) \  
    
    .setEntitiesConfidenceResource("entities_confidence.csv")

    Where entities-confidence.csv has the following data:

    TREATMENT,0.7
    
    PROBLEM,0.9

    We can use that chunk_filter:

    chunker_filter.fit(data).transform(data)

    Producing the following entities:

    |[{chunk, 394, 408, byapproximately, {entity - TREATMENT, sentence - 1, chunk - 2, confidence - 0.7705}, []}, 
    {chunk, 478, 508, single nucleotide polymorphisms, {entity - TREATMENT, sentence - 2, chunk - 3, 
    confidence - 0.7204666}, []}, {chunk, 608, 625, insertion/deletion, {entity - PROBLEM, sentence - 3, 
    chunk - 6, confidence - 0.9841}, []}]|

    As you can see, only the treatment entities with a confidence score of more than 0.7, and the problem entities with a confidence score of more than 0.9 have been kept in the output.

    Functionality: Extended Regex Dictionary Context

    The RegexPatternsDictionary can now use a regex that spawns the 2 previous token and the 2 next tokens. That feature is implemented using regex groups.

    Examples:

    Given the sentence The patient with ssn 123123123 we can use the following regex to capture the entittyssn (\d{9}). Given the sentence The patient has 12 yearswe can use the following regex to capture the entitty(\d{2}) years

    Ease of Use: New Pretrained De-identification Pipelines

    We developed aclinical_deidentificationpretrained pipeline that can be used to de-identify PHI from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate AGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION, CITY, COUNTRY, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STREET, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, IPADDR entities.

    Models Hub Page: https://nlp.johnsnowlabs.com/2021/05/27/clinical_deidentification_en.html

    There is also a lightweight version of the same pipeline trained with memory efficientglove_100dembeddings. Here are the model names:

    • clinical_deidentification
    • clinical_deidentification_glove

    Example:

    Python:

    from sparknlp.pretrained import PretrainedPipeline deid_pipeline = 
    PretrainedPipeline("clinical_deidentification", "en", "clinical/models") 
    deid_pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. 
    The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, 
    Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, 
    Patient's VIN : 1HGBH41JXMN109286.")

    Result:

    {'sentence': ['Record date : 2093-01-13, David Hale, M.D.',
    
    'IP: 203.120.223.13.',
    
    'The driver's license no:A334455B.',
    
    'the SSN:324598674 and e-mail: hale@gmail.com.',
    
    'Name :Hendrickson, Ora MR. #719435 Date : 01/13/93.',
    
    'PCP : Oliveira, 25 years-old.',
    
    'Record date : 2079-11-09, Patient's VIN :1HGBH41JXMN109286.'],
    
    'masked': ['Record date :<DATE, <DOCTOR, M.D.',
    
    'IP: <IPADDR.',
    
    'The driver's license <DLN.',
    
    'the <SSN and e-mail: <EMAIL.',
    
    'Name : <PATIENT MR. # <MEDICALRECORD Date : <DATE.',
    
    'PCP : <DOCTOR, <AGE years-old.',
    
    'Record date : <DATE, Patient's VIN :<VIN.'],
    
    'obfuscated': ['Record date :2093-01-18, Dr Alveria Eden, M.D.',
    
    'IP: 001.001.001.001.',
    
    'The driver's license K783518004444.',
    
    'the SSN-400-50-8849 and e-mail: Merilynn@hotmail.com.',
    
    'Name : Charls Danger MR. # J3366417 Date : 01-18-1974.',
    
    'PCP : Dr Sina Sewer, 55 years-old.',
    
    'Record date : 2079-11-23, Patient's VIN :6ffff55gggg666777.'],
    
    'ner_chunk': ['2093-01-13',
    
    'David Hale',
    
    'no:A334455B',
    
    'SSN:324598674',
    
    'Hendrickson, Ora',
    
    '719435',
    
    '01/13/93',
    
    'Oliveira',
    
    '25',
    
    '2079-11-09',
    
    '1HGBH41JXMN109286']}

    Get Started

    How useful was this post?

    Our additional expert:
    Veysel is a Head of Data Science at John Snow Labs, improving the Spark NLP for the Healthcare library and delivering hands-on projects in Healthcare and Life Science. Holding a PhD degree in ML, Dr. Kocaman has authored more than 25 papers in peer reviewed journals and conferences in the last few years, focusing on solving real world problems in healthcare with NLP. He is a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence, and big data with over ten years of experience. Veysel has broad consulting experience in Statistics, Data Science, Software Architecture, DevOps, Machine Learning, and AI to several start-ups, boot camps, and companies around the globe. He also speaks at Data Science & AI events, conferences and workshops, and has delivered more than a hundred talks at international as well as national conferences and meetups.

    High Accuracy Resolution of Medical Entities to Standard Codes Using Novel Sentence Embeddings

    The release of Spark NLP for Healthcare 3.1 brings significant speed and accuracy improvements for the task of entity resolution, also known...
    preloader