Detecting a Dataset Shift in PHI Data to Ensure De-Identification Model Performance on Future Data

28.01.2025

Veysel Kocaman

Understanding the Use Case

The PHI detection solution we delivered to the customer is designed to identify sensitive information in clinical notes, ensuring privacy and compliance with healthcare standards. However, datasets are not static. Over time, the types of documents, patient populations, or even the medical contexts they work with may change. This is what we call dataset shift—when the characteristics of the data deviate from the training and validation datasets.

For example, today’s clinical notes might predominantly involve certain formats, like discharge summaries or lab reports, while tomorrow they could include telehealth transcripts or scanned handwritten notes. These shifts can affect the model’s performance because it has been trained on a specific distribution of data.

To address that, we will share techniques to detect and address dataset shift to help monitor and maintain the performance of the pipeline over time. Let us walk you through the dimensions of distribution monitoring and how they can ensure the model stays reliable for future datasets.

Monitoring Distribution Shifts

1. Monitoring Dataset Distribution

One of the most significant areas to monitor is the overall dataset distribution, which might evolve due to several factors:

Source of Clinical Notes: The origin of the notes may change, such as data coming from a different health system, provider, or insurance company.
Specialty Focus: The dataset might shift from internal medicine to specialties like oncology, cardiology, or pediatrics, each of which has its unique terminology and patterns.
Document Types: The composition of document types may change, including:
- Free-text notes from physicians.
- Structured official reports like radiology findings.
- Discharge summaries or lab results.
- Scanned handwritten notes.
- EHR data exported in standardized formats.

Changes in these aspects can alter the underlying statistical properties of the dataset. For example:

A dataset dominated by oncology notes might feature different vocabulary and entities compared to general internal medicine.
Free-text notes might be less structured and more variable than standardized EHR data, posing unique challenges.

How to Monitor:

Track the frequency of different document sources, specialties, and types over time.
Compare the token frequencies, document lengths, and structure with the baseline training dataset.
Use visualization tools to highlight shifts in these characteristics.

Indicators of Shift:

A significant increase in documents from new specialties or sources.
New patterns in document length or structure (e.g., shorter notes or unstructured text replacing well-structured formats).

Proactive monitoring ensures that such shifts are detected early, allowing for adjustments to the pipeline or further investigation.

2. Monitoring Entity Distribution

Entity distribution is another critical aspect. The PHI detection model relies on recognizing named entities like patient names, dates, and identifiers. If the proportions of these entities change, it could affect the model’s predictions. For example:

If currently 30% of entities are names and that drops to 10% in the future, the documents’ characteristics might have changed.
Changes in the co-occurrence of entities (e.g., names often appearing with dates in one dataset but not in another) can also impact predictions.

To monitor this, we suggest calculating and tracking the ratio of each entity type in the dataset over time. If the ratio of any entity type deviates significantly from the baseline, it could signal a dataset shift.

Even a subtle shift in entity distribution could alter how the sequence of tokens appears, which in turn could impact the co-occurrence relationships the model learned during training. It needs to be monitored closely.

3. Monitoring Confidence Distribution

How confident the model is in its predictions for each type of entity across different document types?

For instance, if the model is very confident in detecting names in today’s dataset but less so in a future dataset, that’s a red flag. It might mean the model is encountering patterns it hasn’t seen before.

We suggested creating sub-cohorts within the current dataset (e.g., based on document type or source) and calculating confidence distributions for each sub-cohort. This becomes a benchmark. In the future, you can run the model on representative samples from new datasets and check if confidence levels are consistent. If not, the model might be struggling with unseen patterns.

This approach doesn’t just detect shifts; it helps understand where the shift is happening—whether it’s a specific type of document or a particular entity type that’s causing the issue.

What Happens After Detecting a Shift?

These monitoring techniques are not about saying, “The model is broken.” Instead, they’re about generating signals to trigger further investigation:

Sample Annotation: Pick representative samples from the shifted dataset, annotate them, and check how well the model performs on this subset.
Retraining: If the shift is significant, incorporate the new samples into the training dataset and retrain the model to adapt to the new patterns.
Validation: Validate the updated model against the new dataset to ensure it performs reliably.

By automating these monitoring processes, you can minimize manual effort and quickly spot issues as new datasets come in.

Final Thoughts

The key takeaway is that dataset shift is inevitable. New types of documents, evolving healthcare practices, and changing patient demographics will all contribute to it. But with the right monitoring strategy—keeping an eye on dataset distribution, entity distribution, and confidence distribution—you can stay ahead of these changes.

Even if shifts occur, we have the tools to detect, analyze, and address them to keep PHI detection pipeline running smoothly.

Try Healthcare NLP

See in action

Veysel Kocaman

Our additional expert:

Veysel is the Chief Technology Officer at John Snow Labs, improving the Spark NLP for the Healthcare library and delivering hands-on projects in Healthcare and Life Science. Holding a PhD degree in ML, Dr. Kocaman has authored more than 25 papers in peer reviewed journals and conferences in the last few years, focusing on solving real world problems in healthcare with NLP. He is a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence, and big data with over ten years of experience. Veysel has broad consulting experience in Statistics, Data Science, Software Architecture, DevOps, Machine Learning, and AI to several start-ups, boot camps, and companies around the globe. He also speaks at Data Science & AI events, conferences and workshops, and has delivered more than a hundred talks at international as well as national conferences and meetups.

Robustness Testing of LLM Models Using LangTest in Databricks

Kalyan Chakravarthy

In the world of natural language processing (NLP), LLMs like GPT-4 have changed the game for how machines understand and generate human...