AI-Powered Oncology: Healthcare NLP’s Role in Cancer Research and Treatment

30.01.2025

Gursev Pirge

Researcher and Data Scientist

This blog post explores how John Snow Labs’ Healthcare NLP & LLM library revolutionizes oncology case analysis by extracting actionable insights from clinical text. Key use cases include detecting valuable information using NER, assertion status, relation extraction, and ICD-10 mapping models; summarizing reports and enabling Q&A with LLMs; and leveraging zero-shot NER for identifying new entities with minimal effort. These approaches streamline oncology data analysis, enhance decision-making, and improve patient outcomes.

Despite encouraging declines in several cancer types, the National Annual Report (2014–2018) reveals concerning upward trends in new cancer cases across multiple sites. Most notably, men face a significant 3.0% annual increase in prostate cancer cases, while women show a worrying 1.8% rise in melanoma of the skin. The liver and intrahepatic bile duct cancer rates are climbing in both genders — 0.4% for men and 1.6% for women. Particularly troubling is the consistent rise in pancreatic cancer, with increases of 1.1% and 1.0% for men and women respectively. These statistics underscore the ongoing challenges in cancer prevention and the urgent need for early detection techniques, especially for these rapidly rising cancer types.

This growing prevalence underscores the need for advanced tools to analyze and interpret the vast amounts of clinical data generated in oncology. With breakthroughs in Natural Language Processing (NLP), and Large Language Models (LLMs), particularly through the Healthcare NLP library, healthcare professionals can extract critical insights from vast amounts of unstructured medical data, including clinical notes, pathology reports, and medical literature more effectively, enhancing our understanding of cancer cases and improving patient care.

These AI tools can identify subtle patterns and risk factors that might be overlooked in traditional screening methods, potentially enabling earlier interventions and improving patient outcomes.

https://seer.cancer.gov/report_to_nation/statistics.html

This blog post details a comprehensive approach for leveraging medical NLP models to analyze oncology cases, using a combination of Healthcare NLP’s domain-specific capabilities and the versatility of LLMs. Here, we tackle the complexities of clinical data processing through some key use cases:

Metastasis Detection through Entity Extraction and Relationship Mapping:
The process begins with extracting relevant oncological information using Named Entity Recognition (NER). We identify crucial entities such as cancer types, metastasis sites, and patient demographics. Assertion status detection then determines whether specific conditions, like metastasis, are present or negated in the text. To further enrich the analysis, relation extraction techniques link oncological entities, providing a structured representation of interconnected findings. Finally, using the ICD-10 code resolver, the extracted data is mapped to standard medical codes, ensuring alignment with global diagnostic frameworks.
Biomarker Analysis and Relationship Extraction:
Biomarkers play a pivotal role in modern oncology, serving as indicators for diagnosis, prognosis, and treatment response. Our second use case focuses on first identifying the documents which involve mentions of biomarkers by using a text classifier and then extracting biomarker-related information from clinical reports, identifying key markers and their associated results, such as numeric values or categorical outcomes. Relation extraction is used to connect biomarkers to their respective results, enabling a detailed understanding of the role biomarkers play in cancer diagnosis.
Summarization and Question-Answering with LLMs:
Large language models have proven their utility in making clinical text more accessible. In this use case, we explore how LLMs can summarize lengthy oncology reports into concise narratives, making critical insights easily digestible for clinicians. Additionally, question-answering functionality allows users to query reports directly, retrieving specific information quickly, such as treatment history or tumor staging, without manually combing through the text.
Zero-Shot Named Entity Recognition for Flexible Analysis:
The final use case highlights the power of zero-shot NER, a cutting-edge approach that enables entity recognition with minimal training or annotation. This method proves especially valuable in scenarios where new entity types or rare conditions need to be identified rapidly, offering unparalleled adaptability to diverse and evolving oncology datasets.

Together, these use cases illustrate the transformative potential of combining Healthcare NLP and LLMs for oncology case analysis. Whether it’s detecting metastasis, interpreting biomarker results, or leveraging zero-shot capabilities for rapid adaptation, this blog post demonstrates how Medical Language Models can enhance decision-making and improve patient outcomes in oncology.

John Snow Labs, offers a powerful NLP & LLM library tailored for healthcare, empowering professionals to extract actionable insights from medical text. Utilizing advanced techniques like NER, assertion status detection, and relation extraction, this library helps uncover vital cancer information for more accurate diagnosis, treatment, and prevention.

Let us start with a short Healthcare NLP introduction and then discuss the applications of John Snow Labs’ Healthcare NLP & LLM library in various oncology settings.

Healthcare NLP & LLM

The Healthcare Library is a powerful component of John Snow Labs’ Healthcare NLP platform, designed to facilitate NLP tasks within the healthcare domain. This library provides over 2,500 pre-trained models and pipelines tailored for medical data, enabling accurate information extraction, NER for clinical and medical concepts, and text analysis capabilities. Regularly updated and built with cutting-edge algorithms, the Healthcare library aims to streamline information processing and empower healthcare professionals with deeper insights from unstructured medical data sources, such as electronic health records, clinical notes, and biomedical literature.

John Snow Labs has created custom large language models (LLMs) tailored for diverse healthcare use cases. These models come in different sizes and quantization levels, designed to handle tasks such as summarizing medical notes, answering questions, performing retrieval-augmented generation (RAG), named entity recognition and facilitating healthcare-related chats.

John Snow Labs’ GitHub repository serves as a collaborative platform where users can access open-source resources, including code samples, tutorials, and projects, to further enhance their understanding and utilization of Healthcare NLP and related tools.

John Snow Labs also offers periodic certification training to help users gain expertise in utilizing the Healthcare Library and other components of their NLP platform.

John Snow Labs’ demo page provides a user-friendly interface for exploring the capabilities of the library, allowing users to interactively test and visualize various functionalities and models, facilitating a deeper understanding of how these tools can be applied to real-world scenarios in healthcare and other domains.

The Oncology Use Cases reference notebook provides a comprehensive guide to using medical Language Models for analyzing oncology-related clinical text. This resource demonstrates the use of cutting-edge NLP techniques for extracting and analyzing cancer-specific information, including identifying cancer types, detecting patient assertion status, extracting relations between entities, and performing text classification tasks.

With practical examples tailored to real-world oncology scenarios, the notebook features more than 60 pre-trained models, including NER, assertion status detection, relation extraction, and text classification models.

Models For Oncology-Related Tasks

Use Case-1: Processing Oncology Notes with Metastasis

The first use case focuses on metastasis, which is the spread of cancer cells from the original tumor site to other parts of the body. Metastasis occurs when cancer cells break away from the primary tumor, enter the bloodstream or lymphatic system, and form new tumors in other parts of the body. This process is what defines cancer as metastatic or Stage IV. Metastasis is the primary cause of cancer-related deaths and significantly reduces the possibilities of curative treatment. The most common sites for metastasis are the lungs, liver, brain, and bones. Understanding metastasis is crucial for cancer diagnosis, treatment, and prognosis.

https://www.cancer.gov/publications/dictionaries/cancer-terms/def/metastasis

We will use MT ONCOLOGY NOTES, which comprises millions of electronic health records (EHR) of patients. It contains semi-structured data like demographics, insurance details, and a lot more, but most importantly, it also contains free-text data like real encounters and notes. Here we show how to use Healthcare NLP’s existing models to process raw text and extract highly specialized cancer information that can be used for various downstream use cases. The reports are similar to the Oncology report below.

This use case includes a detailed workflow for analyzing oncology-related clinical and medical files to extract critical information specific to metastatic cases. The process begins with an efficient filtering stage using the DocumentFiltererByNER parameter, which scans all available documents and selects only those containing mentions of “Metastasis.” This filtering step significantly streamlines the analysis by focusing on the most relevant documents, reducing noise, and ensuring more targeted downstream processing.

# Filter to get only the texts which include Metastasis cases
filterer = DocumentFiltererByNER() \
    .setInputCols(["sentence", "cancer_chunk"]) \
    .setOutputCol("filterer") \
    .setWhiteList(["Metastasis"])

You can see in the below dataframe (small part of the result dataframe) that only patient files 4, 7 and 8 include metastasis.

Once the metastasis-related files are identified, advanced medical models are applied to extract and analyze the clinical information. The notebook utilizes pre-trained models for NER to identify key entities such as cancer types, anatomical locations, and metastasis mentions.

Healthcare NLP Display is an open-source python library for visualizing the generated results. The ability to quickly visualize the entities/relations/assertion statuses, etc. generated using Healthcare NLP is a very useful feature for speeding up the development process as well as for understanding the obtained results.

The NerVisualizer highlights the named entities that are identified by the NER model and also displays their labels as decorations on top of the analyzed text.

Assertion Status Detection models consider the clinical context and then determine the assertion status of certain entities extracted by the NER model. In this use case, we filtered the Present entities.

assertion_filterer = AssertionFilterer()\
    .setInputCols("sentence","assertion_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Present"])

The resulting dataframe and the visuals are below:

The AssertionVisualizer displays the assertion model’s labels on top of the named entities predicted by the NER model.

Relation Extraction models identify connections between entities. In this case, the relationship between cancer types or metastasis sites and primary cancer locations was established by defining the entities of interest as shown in the code snippet.

oncology_location_re = RelationExtractionModel.pretrained("re_oncology_location", "en", "clinical/models") \
    .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \
    .setOutputCol("relation_extraction") \
    .setRelationPairs(['Anatomical_Site-Adenopathy',
                       'Anatomical_Site-Cancer_Dx',
                       'Anatomical_Site-Histological_Type',
                       'Anatomical_Site-Metastasis',
                       'Anatomical_Site-Tumor_Finding',
                       'Anatomical_Site-Oncological',
                       'Anatomical_Site-CNS_Tumor_Type',
                       'Anatomical_Site-Carcinoma_Type'
                       ]) \
    .setMaxSyntacticDistance(10)

The resulting dataframe and the visuals are below:

The RelationExtractionVisualizer can be used to visualize the relations predicted by the models.

Healthcare NLP library also makes it possible to map the entities extracted by the NER model to the ICD-10 codes (models for SNOMED, ICD-O, Rx Norm and many more coding systems are available in the library).

icd10cm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")

Please check the notebook for a more detailed solution, but the result dataframe for the disease related entities is shown below:

Here is the result dataframe for the body part entities:

A quick summary, this workflow demonstrates the practical application of NLP in oncology research, providing actionable insights for analyzing metastatic cases. By integrating efficient filtering with robust medical models, this approach ensures a focused and accurate analysis of clinical texts, tailored to the needs of oncology practitioners and researchers.

Use Case-2: Biomarker and Biomarker Result Table Generation from Oncology Notes

Second use case is about a streamlined method to extract and organize information about biomarkers and their results from sections within oncology notes specifically dedicated to biomarker analysis. This data is essential for conducting robust data analysis and advancing biomarker research. However, manually extracting this information from lengthy oncology notes is a time-consuming, labor-intensive process prone to human error.

The approach for this task is shown below:

The first stage uses a Text Classification model (bert_sequence_classifier_biomarker), to determine whether the clinical sentences include terms related to biomarkers or not. Next stage is using the results to filter the related sentences.

# Determine whether the clinical sentences include terms related to biomarkers or not

sequenceClassifier = BertForSequenceClassification \ 
    .pretrained("bert_sequence_classifier_biomarker","en","clinical/models")\
    .setInputCols(["sentence",'token'])\
    .setOutputCol("prediction")

# Filter to get only the sentences which include Biomarkers.

document_filterer = medical.DocumentFiltererByClassifier()\
    .setInputCols(["sentence", "prediction"])\
    .setOutputCol("filtered_documents")\
    .setWhiteList(["1"])

The “classes” column in the dataframe below indicates whether each sentence contains biomarker-related information, categorizing them as either biomarker-relevant or non-biomarker sentences.

In the second stage, the NER model identifies and extracts biomarker and biomarker result entities from clinical text.

Relation Extraction is the final stage, which establishes connections between biomarkers and their corresponding results. This structured approach enables comprehensive analysis of biomarker data by progressively moving from basic classification to relationship identification, ensuring thorough and accurate results.

Using LLMs: QA & Summarization

In this section, we explore the use of large language models (LLMs) and question-answering (QA) techniques to extract valuable insights from clinical notes. By posing targeted questions, we generate concise summaries and retrieve meaningful answers, enabling a deeper understanding of the data and streamlining analysis. jsl_medm_q8_v1 is used to get the results.

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

medical_llm = MedicalLLM.pretrained("jsl_medm_q8_v1", "en", "clinical/models")\
    .setInputCols("document")\
    .setOutputCol("completions")\
    .setBatchSize(1)\
    .setNPredict(100)\
    .setUseChatTemplate(True)\
    .setTemperature(0)

llm_pipeline = Pipeline(
    stages = [
        document_assembler,
        medical_llm
])

First case is asking a question about the stage of the cancer mentioned in the text and the model’s answer.

Second case is getting a summary of the clinical report by a prompt.

Using LLMs for question-answering and summarization enables efficient extraction of key insights from clinical notes, making complex data more accessible and actionable for research and analysis.

Zero Shot Oncology NER Model

Zero Shot NER enables the identification of entities in text with minimal effort. By using pre-trained language models and contextual understanding, Zero Shot NER extends entity recognition capabilities to new domains and languages.

The model is designed to support any set of entity labels, allowing users to adapt it to their specific use cases. For best results, it is recommended to use labels that are conceptually similar to the provided defaults.

labels = ["Adenopathy", "Age","Biomarker","Biomarker_Result","Body_Part","Cancer_Dx","Cancer_Surgery",
        "Cycle_Count","Cycle_Day","Date","Death_Entit","Directio","Dosage","Duration","Frequency",
        "Gender","Grade","Histological_Type","Imaging_Test","Invasion","Metastasis","Oncogene","Pathology_Test",
        "Race_Ethnicity","Radiation_Dose","Relative_Date","Response_To_Treatment","Route","Smoking_Status",
        "Staging","Therapy","Tumor_Finding","Tumor_Size"]

pretrained_zero_shot_ner = medical.PretrainedZeroShotNER().pretrained("zeroshot_ner_oncology_medium", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("ner")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

The entities extracted (according to the labels defined by the user) by the zero shot model are:

Zero Shot NER models offer a flexible and efficient solution for identifying new entities without the need for extensive training data. This approach streamlines entity extraction, making it ideal for adapting to evolving research needs with minimal effort.

Conclusion

This blog post demonstrates how John Snow Labs’ Healthcare NLP and LLM library is reshaping oncology data analysis by automating the extraction, organization, and interpretation of critical clinical information. Through the use of advanced medical models, researchers and clinicians can efficiently extract key entities such as biomarkers, metastasis mentions, and cancer types using NER, assertion status detection, and relation extraction models. Additionally, the integration of ICD-10 mapping ensures precise coding and classification of oncology cases for deeper insights.

The second section provides a systematic approach to biomarker analysis by combining text classification, entity extraction, and relation extraction. Together, these stages enable efficient processing, accurate identification, and meaningful connections between biomarkers and their results, ensuring comprehensive and reliable insights.

LLMs play a transformative role in oncology analysis, enabling the summarization of lengthy clinical notes and facilitating interactive question-answering to extract valuable insights on demand. Furthermore, the introduction of zero-shot NER capabilities empowers users to identify and classify new entities with minimal effort, making the library adaptable to emerging research needs.

By streamlining oncology data processing, enhancing decision-making, and reducing the time and effort required for manual data extraction, these tools play a critical role in advancing oncology research and improving patient outcomes. Whether it’s uncovering actionable insights or facilitating innovative research, the Healthcare NLP and LLM library offers a robust solution to meet the challenges of modern oncology.

Healthcare NLP models are licensed, so if you want to use these models, you can watch “Get a Free License For John Snow Labs NLP Libraries” video and request one from here.

Try Healthcare NLP

See in action

Gursev Pirge

Researcher and Data Scientist

Our additional expert:

A Researcher and Data Scientist with demonstrated success delivering innovative policies and machine learning algorithms, having strong statistical skills, and presenting to all levels of leadership to improve decision making. Experience in Education, Logistics, Data Analysis and Data Science. Strong education professional with a Doctor of Philosophy (Ph.D.) focused on Mechanical Engineering from Boğaziçi University.

Comparing Medical Text De-Identification Performance: John Snow Labs, OpenAI, Anthropic Claude, Azure Health Data Services, and Amazon Comprehend Medical

Muhammet Santas

In an era of rapidly advancing healthcare technology, the protection of patient privacy is more critical than ever. Medical records, rich with...