In an era of rapidly advancing healthcare technology, the protection of patient privacy is more critical than ever. Medical records, rich with sensitive information, are invaluable for research and innovation but must be carefully managed to ensure compliance with regulations like HIPAA and GDPR. De-identification, the process of removing or obscuring personally identifiable information (PII) from medical data, lies at the heart of this effort.
Leveraging LLMs for de-identifying sensitive data (PHI) might be considered excessive and potentially unreliable, depending on the use case and the required level of customization. While LLMs like ChatGPT are highly capable of generating high-quality text, they are not specifically designed or optimized for the precise task of PHI de-identification.
This blog explores the performance and comparison of de-identification services provided by Healthcare NLP, Amazon, and Azure, focusing on their accuracy when applied to a dataset annotated by healthcare experts.
This comparison is valuable because it provides critical insights into the strengths and limitations of each de-identification solution, enabling researchers, developers, and organizations to make informed decisions when choosing a tool. For researchers, it helps identify the most accurate and reliable service for sensitive data processing. For developers, it highlights the ease of integration and API flexibility, which are crucial for building scalable solutions. For organizations, particularly in healthcare and finance, it offers a clear perspective on compliance capabilities, and performance, ensuring the selected tool meets regulatory requirements while optimizing operational efficiency.
Dataset
For this benchmark, we utilized 48 open-source documents annotated by domain experts from John Snow Labs. The annotations focused on the extraction of IDNUM, LOCATION, DATE, AGE, NAME, and CONTACT entities because these labels represent critical personal information that is commonly targeted for de-identification in various healthcare. These labels are typically associated with sensitive aspects of a person’s identity and are often required to be removed or anonymized in compliance with regulations like HIPAA or GDPR. By centering the benchmark around these entities, the task ensures that the performance evaluation is directly relevant to the challenges of real-world de-identification, which often involves identifying and obscuring such critical personal data.
Tools Compared
Healthcare NLP & LLM
The Healthcare Library is a powerful component of John Snow Labs’ Spark NLP platform, designed to facilitate NLP tasks within the healthcare domain. This library provides over 2,500 pre-trained models and pipelines tailored for medical data, enabling accurate information extraction, NER for clinical and medical concepts, and text analysis capabilities. Regularly updated and built with cutting-edge algorithms, the Healthcare library aims to streamline information processing and empower healthcare professionals with deeper insights from unstructured medical data sources, such as electronic health records, clinical notes, and biomedical literature.
John Snow Labs has created custom large language models (LLMs) tailored for diverse healthcare use cases. These models come in different sizes and quantization levels, designed to handle tasks such as summarizing medical notes, answering questions, performing retrieval-augmented generation (RAG), named entity recognition and facilitating healthcare-related chats.
John Snow Labs’ Healthcare NLP & LLM library offers a powerful solution to streamline the de-identification of medical records. By leveraging advanced Named Entity Recognition (NER) models, the library can automatically identify and deidentify Protected Health Information (PHI) from unstructured clinical notes. This capability ensures compliance with privacy regulations while maintaining the utility of the data for research and analysis. With this technology, healthcare organizations can efficiently anonymize sensitive information, enabling the safe sharing of clinical data for studies, improving patient privacy, and fostering innovation in medical research.
With the Healthcare NLP library, you can either build a custom de-identification pipeline to target specific labels or use pretrained pipelines with just two lines of code to deidentify a wide range of entities, including AGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION, CITY, COUNTRY, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, STREET, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, IPADDR, and EMAIL.
For this benchmark, we specifically used the clinical_de-identification_docwise_benchmark pretrained pipeline, which is designed to extract and de-identify NAME
, IDNUM
, CONTACT
, LOCATION
, AGE
, DATE
entities. It’s important to note that this pipeline does not rely on any LLM components.
Here is a sample code for using the Healthcare NLP pipeline:
from sparknlp.pretrained import PretrainedPipeline deid_pipeline = PretrainedPipeline("clinical_deidentification_docwise_benchmark", "en", "clinical/models") sample_text = "Patient John Doe was admitted to Boston General Hospital on 01/12/2023." result = deid_pipeline.fullAnnotate(sample_text)
Azure Health Data Services
Azure Health Data Services’ de-identification service is designed to protect sensitive health information while preserving data utility. This API leverages natural language processing techniques to identify, label, redact, or surrogate Protected Health Information (PHI) in unstructured medical texts. Launched in 2024, the service offers three key operations: Tag, Redact, and Surrogate, enabling healthcare organizations to process diverse types of clinical documents securely and efficiently.
The de-identification service stands out for its ability to balance privacy concerns with data usability. By employing machine learning algorithms, it can detect HIPAA’s 18 identifiers and other PHI entities, ensuring compliance with various regional privacy regulations such as GDPR and CCPA. This innovative tool empowers healthcare professionals, researchers, and organizations to unlock the potential of their clinical data for machine learning, analytics, and collaborative research while maintaining the highest standards of patient privacy and data protection. You can get more details from here: https://learn.microsoft.com/en-us/azure/healthcare-apis/deidentification/
Here is a sample code for using Azure Health Data Services:
import os CREDENTIALS_IN_JSON_FORMAT = { "AZURE_CLIENT_ID":"", # update here with yours "AZURE_TENANT_ID":"", # update here with yours "AZURE_CLIENT_SECRET":"" # update here with yours } for k, v in CREDENTIALS_IN_JSON_FORMAT.items(): os.environ[k] = v # This is constant for one app setup os.environ['AZURE_HEALTH_DEIDENTIFICATION_ENDPOINT'] = '' # update here from azure.health.deidentification import DeidentificationClient from azure.identity import DefaultAzureCredential from azure.health.deidentification.models import * endpoint = os.environ["AZURE_HEALTH_DEIDENTIFICATION_ENDPOINT"] endpoint = endpoint.replace("https://", "") credential = DefaultAzureCredential(exclude_interactive_browser_credential=False) client = DeidentificationClient(endpoint, credential) sample_text = "Patient John Doe was admitted to Boston General Hospital on 01/12/2023." body = DeidentificationContent(input_text=sample_text, operation="tag") result: DeidentificationResult = client.deidentify(body)
Amazon Medical Comprehend
Amazon Comprehend Medical is a HIPAA-eligible natural language processing (NLP) service that leverages machine learning to extract valuable health data from unstructured medical text. This powerful tool can quickly and accurately identify medical entities such as conditions, medications, dosages, tests, treatments, and protected health information (PHI) from various clinical documents including physician’s notes, discharge summaries, and test results. With its ability to understand context and relationships between extracted information, Amazon Comprehend Medical offers a robust solution for healthcare professionals and researchers looking to automate data extraction, improve patient care, and streamline clinical workflows. You can get more details from here: https://docs.aws.amazon.com/comprehend-medical/latest/dev/textanalysis-phi.html
Here is a sample code for using Amazon Medical Comprehend:
import boto3 # Extract validated credentials for role assumption validated_access_key_id = MFA_validated_token['Credentials']['AccessKeyId'] validated_secret_access_key = MFA_validated_token['Credentials']['SecretAccessKey'] validated_session_token = MFA_validated_token['Credentials']['SessionToken'] temp_sts_client = boto3.client( 'sts', aws_access_key_id=validated_access_key_id, aws_secret_access_key=validated_secret_access_key, aws_session_token=validated_session_token ) # choose your role target_role_arn = "" # update here # Assume the desired role response = temp_sts_client.assume_role( RoleArn=target_role_arn, RoleSessionName='MedComp' ) # temporary creds tmp_access_key_id = response['Credentials']['AccessKeyId'] tmp_secret_access_key = response['Credentials']['SecretAccessKey'] tmp_session_token = response['Credentials']['SessionToken'] client = boto3.client(service_name='comprehendmedical', region_name='', # update here aws_access_key_id = tmp_access_key_id, aws_secret_access_key = tmp_secret_access_key, aws_session_token = tmp_session_token ) sample_text = "Patient John Doe was admitted to Boston General Hospital on 01/12/2023." result = client.detect_phi(Text= sample_text)
Comparison of the Tools
The most significant difference between these tools lies in their adaptability. Azure Health Data Services and Amazon Medical Comprehend are API-based, black-box cloud solutions, making modifying or adapting results to specific needs impossible. On the other hand, the Healthcare NLP library’s de-identification pipeline can be loaded and utilized with only two lines of code. You can also customize the outputs by adjusting the pipeline’s stages to meet your specific needs. Furthermore, it can be used locally.
You can find the code for using all these tools in the De-identification Performance Comparison Of Healthcare NLP VS Cloud Solutions Notebook.
Evaluation Criteria
In this benchmark study, we employed two distinct approaches to compare accuracy:
A. Entity-Level Evaluation: Since de-identifying PHI data is a critical task, we evaluated how well de-identification tools detected entities present in the annotated dataset, regardless of their specific labels in the ground truth. The detection outcomes were categorized as:
- full_match: The entire entity was correctly detected.
- partial_match: Only a portion of the entity was detected.
- not_matched: The entity was not detected at all.
For example:
Text: “Patient John Doe was admitted to Boston General Hospital on 01/12/2023.”
Ground Truth Entity: John Doe
(NAME)
Predicted Entity: John Doe
(NAME) ==> full_match
Predicted Entity: John
==> partial_match
If predictions don’t have any match with “John Doe”, the result is “not_matched”.
B. Token-Level Accuracy: The text in the annotated dataset was tokenized, and the ground truth labels assigned to each token were compared with predictions made by the Healthcare NLP library, Amazon Medical Comprehend, and Azure Health Data Services. Classification reports were generated for each tool, comparing their precision, recall, and F1 scores.
This dual approach comprehensively evaluated each tool’s performance in de-identifying PHI data.
Methodology
In this study, some differences were observed between the predictions of the de-identification services and the ground truth labels. In the ground truth dataset, entities were annotated with more generic labels. For example, all names were annotated as NAME instead of using labels like PATIENT_NAME for patient names and DOCTOR_NAME for doctor names. To ensure consistency, the labels from the de-identification tools’ predictions were mapped to the corresponding ground truth labels.
Additionally, for a fair comparison, entities that could not be mapped to the ground truth labels (e.g., PROFESSION, ORGANIZATION, etc.) were excluded from the predictions before comparing the results.
After obtaining the predictions and completing the preprocessing, the entity count table is as follows:
Results and Findings
Entity-Level Evaluation
The results obtained from matching the predictions made by Healthcare NLP, Amazon Medical Comprehend, and Azure Health Data Services with the ground truth entities are as follows:
Based on these results, the percentages representing the match rates for each de-identification tool are plotted as shown below:
Token-Level Evaluation
The resulting data frame, obtained by tokenizing the text annotated as ground truth and adding the corresponding ground truth, Healthcare NLP, Amazon Medical Comprehend, and Azure Health Data Services prediction labels for each token, is as follows:
Based on the token-level results, the classification reports showing the comparison between each de-identification tool’s predictions and the ground truth labels are as follows:
Healthcare NLP:
Azure Health Data Services:
Amazon Medical Comprehend:
Let’s visualize the F1-Scores for each label on a single plot:
PHI Entity Prediction Comparison
The de-identification task is crucial in ensuring the privacy of sensitive data. In this context, the primary focus is not on the specific labels of PHI entities but on whether they are successfully detected. When evaluating the results based on the classification of entities as PHI or non-PHI, the outcomes for PHI entity detection are as follows:
Price Analysis of the Tools
When handling large volumes of clinical notes, cost becomes a key consideration. Since such small datasets are rare in real-world applications, we estimated the expenses for these tools based on processing 1 million unstructured clinical notes, with each document averaging around 5,250 characters.
Amazon Comprehend Medical Pricing: According to the price calculator, obtaining PHI predictions for 1M documents, with an average of 5,250 characters per document, costs $14,525.
Azure Health Data Services Pricing: Based on the pricing page, generating PHI predictions for 1M documents, each averaging 5,250 characters, costs $13,125.
Healthcare NLP Pricing: When using John Snow Labs-Healthcare NLP Prepaid product on an EC2–32 CPU (c6a.8xlarge at $1,2 per hour) machine, obtaining the de-identification codes for PHI entities from approximately 48 documents takes around 39.4 seconds. Based on this, processing 1M documents and de-identifying the PHI entities would take about 13680 minutes (228 hours (approximately 9.5 days), but with proper scaling, it can be completed in just one day), costing $273 for infrastructure and $2145 for the license (considering a 1-month license price of $7,000). Thus, the total cost for Healthcare NLP is approximately $2418.
According to our estimates for processing 1 million clinical notes, Azure and Amazon incur substantially higher costs, while Healthcare NLP allows organizations to perform de-identification at a fraction of the price. This makes it an ideal choice for institutions handling large volumes of data, providing both affordability and customization without compromising performance.
Conclusion
In this study, we compared the performance of Healthcare NLP, Amazon Medical Comprehend, and Azure Health Data Services tools on a ground truth dataset annotated by medical experts. The results were evaluated from two perspectives: chunk-level and token-level.
- The chunk-level evaluation showed that Healthcare NLP outperformed the others in terms of accurately capturing entities and minimizing missed entities. Following Healthcare NLP, Azure Health Data Services and Amazon Medical Comprehend ranked second and third, respectively.
- The token-level evaluation revealed that Healthcare NLP achieved the best results in terms of precision, recall, and F1 score, with Azure Health Data Services and Amazon Medical Comprehend trailing behind in that order.
- Adaptability of Tools: One of the key differences among these tools is their level of adaptability. While Azure Health Data Services and Amazon Medical Comprehend are API-based, black-box cloud solutions that do not allow modifications or customization of results, the Healthcare NLP library offers a highly flexible approach. Its de-identification pipeline can be implemented with just two lines of code, and users can easily customize outputs by modifying the pipeline’s stages to suit specific requirements.
- Cost-Effectiveness: Healthcare NLP emerges as the most cost-effective solution when comparing the cost of de-identification tools for processing large-scale clinical notes. Unlike API-based services such as Azure and Amazon, which charge per request and can become expensive at scale, Healthcare NLP offers a flexible, local deployment option that significantly reduces costs. Additionally, even if you process 1 billion clinical notes, the cost remains fixed for the same time period. Unlike API-based solutions, which charge per request and scale in cost as volume increases, the Healthcare NLP library offers consistent pricing.
In summary, the Healthcare NLP library consistently outperformed Azure Health Data Services and Amazon Medical Comprehend in both chunk-level and token-level evaluations, achieving the highest precision, recall, and F1 scores while minimizing missed entities. Beyond accuracy, its adaptability sets it apart, unlike Azure and Amazon, which are black-box API solutions with no customization options, Healthcare NLP allows users to modify its de-identification pipeline to meet specific needs. Additionally, it proves to be the most cost-effective solution, offering substantial savings compared to cloud-based alternatives that charge per request.
You can find the whole code in this notebook: De-identification Performance Comparison Of Healthcare NLP VS Cloud Solutions Notebook
Beyond its standalone capabilities, the Healthcare NLP library is also available on AWS, Snowflake, and Databricks Marketplaces, making it easier for organizations to integrate de-identification solutions into their existing workflows.
- Seamless Integration: Deploy directly within your preferred cloud ecosystem without a complex setup.
- Scalability: Process large-scale clinical data efficiently.
- Compliance & Security: Designed for healthcare-grade data protection, meeting HIPAA and GDPR standards.
Healthcare NLP models are licensed, so if you want to use these models, you can watch “Get a Free License For John Snow Labs NLP Libraries” video and request one license here.
Try Healthcare NLP
See in action