The release of Spark NLP for Healthcare 3.1 brings significant speed and accuracy improvements for the task of entity resolution, also known as entity linking: the ability to map a medical entity to a standard code. This release supports the SNOMED-CT, ICD-10-CM, ICD-10-PCS, CPT, LOINC, RxNorm, UMLS, HPO, and ICD-O terminologies. The accuracy gains have been vetted by several customers already using real-world data.
New John Snow Labs SBert Sentence Embeddings
One challenge with resolving medical entities to codes is that very often, multiple similar codes exist for a term, and the most appropriate one depends on context. For example, “bladder cancer” may be mapped to any of the following ICD-10-CM standard terms:
Cancer in situ of urinary bladder Carcinoma in situ of bladder Tumor of bladder neck Neoplasm of unspecified behavior of bladder Malignant tumour of bladder neck Secondary malignant neoplasm of bladder Malignant tumor of urinary bladder Malignant neoplasm of bladder, unspecified
Ranking these options – or picking the most relevant one – depends heavily on the context. To provide a better understanding of medical context, we’ve developed a set of new healthcare-specific sentence embeddings, including the first medical sentence embeddings available in different sizes. These embeddings (and the entity resolution models that leverage them) are not availble anywhere else, and result in better accuracy than other embeddings which are now outdated such as BioBERT and ClinicalBERT, for 3 reasons:
- They’re based on a newer deep learning architecture (SBERT).
- We augmented the training data beyond what’s available in public academic datasets.
- They’re more current, because we just retrained them. In contrast, for example, BioBERT was trained in 2019 – before any mention of COVID-19 on PubMed.
The new sBERT models delivered with the 3.1 release are fined tuned on MedNLI, NLI, and UMLS datasets with various parameters to cover common NLP tasks in medical domain:
- sbiobert_jsl_cased
- sbiobert_jsl_umls_cased
- sbert_jsl_medium_uncased
- sbert_jsl_medium_umls_uncased
- sbert_jsl_mini_uncased
- sbert_jsl_mini_umls_uncased
- sbert_jsl_tiny_uncased
- sbert_jsl_tiny_umls_uncased
6X Faster Load Times for Sentence Resolver Models
Sentence resolver models now have faster load times, with an average six-fold speedup when compared to previous versions. Also, the load process now is more memory friendly meaning that the maximum memory required during load time is lower, reducing the chances of out-of-memory exceptions, and thus relaxing hardware requirements.
John Snow Labs SBert Model Speed Benchmark
Model | Base Model | Is Cased | Train Datasets | Inference speed (100 rows) |
---|---|---|---|---|
sbiobert_jsl_cased | biobert_v1.1_pubmed | Cased | medNLI, allNLI | 274,53 |
sbiobert_jsl_umls_cased | biobert_v1.1_pubmed | Cased | medNLI, allNLI, umls | 274,52 |
sbert_jsl_medium_uncased | uncased_L-8_H-512_A-8 | Uncased | medNLI, allNLI | 80,40 |
sbert_jsl_medium_umls_uncased | uncased_L-8_H-512_A-8 | Uncased | medNLI, allNLI, umls | 78,35 |
sbert_jsl_mini_uncased | uncased_L-4_H-256_A-4 | Uncased | medNLI, allNLI | 10,68 |
sbert_jsl_mini_umls_uncased | uncased_L-4_H-256_A-4 | Uncased | medNLI, allNLI, umls | 10,29 |
sbert_jsl_tiny_uncased | uncased_L-2_H-128_A-2 | Uncased | medNLI, allNLI | 4,54 |
sbert_jsl_tiny_umls_uncased | uncased_L-2_H-128_A-2 | Uncased | medNLI, allNL, umls | 4,54 |
Higher Accuracy ICD-10-CM Resolver Models
These models map clinical entities and concepts to ICD-10-CM codes using SBERT sentence embeddings. They also return the official resolution text within the brackets inside the metadata. Both models are augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths).
- sbiobertresolve_icd10cm_slim_billable_hcc: Trained with classic sbiobert mli. (sbiobert_base_cased_mli)
Models Hub Page:
https://nlp.johnsnowlabs.com/2021/05/25/sbiobertresolve_icd10cm_slim_billable_hcc_en.html
- sbertresolve_icd10cm_slim_billable_hcc_med: Trained with new jsl sbert(sbert_jsl_medium_uncased)
Models Hub Page:
https://nlp.johnsnowlabs.com/2021/05/25/sbertresolve_icd10cm_slim_billable_hcc_med_en.html
Example: ‘bladder cancer’
sbiobertresolve_icd10cm_augmented_billable_hcc
chunks | code | all_codes | resolutions | all_distances | 100x Loop(sec) |
---|---|---|---|---|---|
bladder cancer | C679 | [C679, Z126, D090, D494, C7911] | [bladder cancer, suspected bladder cancer, cancer in situ of urinary bladder, tumor of bladder neck, malignant tumour of bladder neck] |
[0.0000, 0.0904, 0.0978, 0.1080, 0.1281] | 26,9 |
sbiobertresolve_icd10cm_slim_billable_hcc
chunks | code | all_codes | resolutions | all_distances | 100x Loop(sec) |
---|---|---|---|---|---|
bladder cancer | D090 | [D090, D494, C7911, C680, C679] | [cancer in situ of urinary bladder [Carcinoma in situ of bladder], tumor of bladder neck [Neoplasm of unspecified behavior of bladder], malignant tumour of bladder neck [Secondary malignant neoplasm of bladder], carcinoma of urethra [Malignant neoplasm of urethra], malignant tumor of urinary bladder [Malignant neoplasm of bladder, unspecified]] |
[0.0978, 0.1080, 0.1281, 0.1314, 0.1284] | 20,9 |
sbertresolve_icd10cm_slim_billable_hcc_med
chunks | code | all_codes | resolutions | all_distances | 100x Loop(sec) |
---|---|---|---|---|---|
bladder cancer | C671 | [C671, C679, C61, C672, C673] | [bladder cancer, dome [Malignant neoplasm of dome of bladder], cancer of the urinary bladder [Malignant neoplasm of bladder, unspecified], prostate cancer [Malignant neoplasm of prostate], cancer of the urinary bladder] |
[0.0894, 0.1051, 0.1184, 0.1180, 0.1200] | 12,8 |