TFGraphBuilder
annotator to create graphs for training NER, Assertion, Relation Extraction, and Generic Classifier models- Default TF graphs added for
AssertionDLApproach
to let users train models without custom graphs - New functionalities in
ContextualParserApproach
- Printing the list of clinical pretrained models and pipelines with a one-liner
- New clinical models
- Clinical NER model (
ner_biomedical_bc2gm
) - Clinical
ChunkMapper
models (abbreviation_mapper, rxnorm_ndc_mapper, drug_brandname_ndc_mapper, rxnorm_action_treatment_mapper
)
TFGraphBuilder
annotator to create graphs for Train NER, Assertion, Relation Extraction, and Generic Classifier Models
We have a new annotator used to create graphs in the model training pipeline. TFGraphBuilder
inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.
You can use this builder with MedicalNerApproach, RelationExtractionApproach, AssertionDLApproach,
and GenericClassifierApproach
Example:
graph_folder_path = "./medical_graphs" med_ner_graph_builder = TFGraphBuilder()\ .setModelName("ner_dl")\ .setInputCols(["sentence", "token", "embeddings"]) \ .setLabelColumn("label")\ .setGraphFile("auto")\ .setHiddenUnitsNumber(20)\ .setGraphFolder(graph_folder_path) med_ner = MedicalNerApproach() \ ... .setGraphFolder(graph_folder) medner_pipeline = Pipeline()([ ..., med_ner_graph_builder, med_ner ])
For more examples, please check TFGraph Builder Notebook.
Default TF graphs added for AssertionDLApproach
to let users train models without custom graphs
We added default TF graphs for the AssertionDLApproach
to let users train assertion models without specifying any custom TF graph.
Default Graph Features:
- Feature Sizes: 100, 200, 768
- Number of Classes: 2, 4, 8
New Functionalities in ContextualParserApproach
- Added
.setOptionalContextRules
parameter that allows to output regex matches regardless of context match (prefix, suffix configuration). - Allows sending a JSON string of the configuration file to
setJsonPath
parameter.
Confidence Value Scenarios:
- When there is regex match only, the confidence value will be 0.5.
- When there are regex and prefix matches together, the confidence value will be > 0.5 depending on the distance between target token and the prefix.
- When there are regex and suffix matches together, the confidence value will be > 0.5 depending on the distance between target token and the suffix.
- When there are regex, prefix, and suffix matches all together, the confidence value will be > than the other scenarios.
Example:
jsonString = { "entity": "CarId", "ruleScope": "sentence", "completeMatchRegex": "false", "regex": "\\d+", "prefix": ["red"], "contextLength": 100 } with open("jsonString.json", "w") as f: json.dump(jsonString, f) contextual_parser = ContextualParserApproach()\ .setInputCols(["sentence", "token"])\ .setOutputCol("entity")\ .setJsonPath("jsonString.json")\ .setCaseSensitive(True)\ .setOptionalContextRules(True)
Printing the List of Clinical Pretrained Models and Pipelines with One-Liner
Now we can check what the clinical model names are of a specific annotator and the names of clinical pretrained pipelines in a language.
Clinical Pipeline Names:
Example:
from sparknlp_jsl.pretrained import InternalResourceDownloader InternalResourceDownloader.showPrivatePipelines("en")
+--------------------------------------------------------+------+---------+ | Pipeline | lang | version | +--------------------------------------------------------+------+---------+ | clinical_analysis | en | 2.4.0 | | clinical_ner_assertion | en | 2.4.0 | | clinical_deidentification | en | 2.4.0 | | clinical_analysis | en | 2.4.0 | | explain_clinical_doc_ade | en | 2.7.3 | | icd10cm_snomed_mapping | en | 2.7.5 | | recognize_entities_posology | en | 3.0.0 | | explain_clinical_doc_carp | en | 3.0.0 | | recognize_entities_posology | en | 3.0.0 | | explain_clinical_doc_ade | en | 3.0.0 | | explain_clinical_doc_era | en | 3.0.0 | | icd10cm_snomed_mapping | en | 3.0.2 | | snomed_icd10cm_mapping | en | 3.0.2 | | icd10cm_umls_mapping | en | 3.0.2 | | snomed_umls_mapping | en | 3.0.2 | | … | … | … | +--------------------------------------------------------+------+---------+
New ner_biomedical_bc2gm
NER Model
This model has been trained to extract genes/proteins from a medical text.
See Model Card for more details.
Example:
... ner = MedicalNerModel.pretrained("ner_biomedical_bc2gm", "en", "clinical/models")\ .setInputCols(["sentence", "token", "embeddings"]) \ .setOutputCol("ner") ... text = spark.createDataFrame([["Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections."]]).toDF("text") result = model.transform(text)
+-----------+------------+ |chunk |ner_label | +-----------+------------+ |S-100 |GENE_PROTEIN| |HMB-45 |GENE_PROTEIN| |cytokeratin|GENE_PROTEIN| +-----------+------------+
New Clinical ChunkMapper
Models
We have 4 new ChunkMapper
models and a new Chunk Mapping Notebook for showing their examples.
drug_brandname_ndc_mapper
: This model maps drug brand names to corresponding National Drug Codes (NDC). Product NDCs for each strength are returned in results and metadata.
See Model Card for more details.
Example:
document_assembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("chunk") chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\ .setInputCols(["chunk"])\ .setOutputCol("ndc")\ .setRel("Strength_NDC") model = PipelineModel(stages=[document_assembler, chunkerMapper]) light_model = LightPipeline(model) res = light_model.fullAnnotate(["zytiga", "ZYVOX", "ZYTIGA"])
+-------------+--------------------------+-----------------------------------------------------------+ | Brandname | Strenth_NDC | Other_NDSs | +-------------+--------------------------+-----------------------------------------------------------+ | zytiga | 500 mg/1 | 57894-195 | ['250 mg/1 | 57894-150'] | | ZYVOX | 600 mg/300mL | 0009-4992 | ['600 mg/300mL | 66298-7807', '600 mg/300mL | 0009-7807'] | | ZYTIGA | 500 mg/1 | 57894-195 | ['250 mg/1 | 57894-150'] | +-------------+--------------------------+-----------------------------------------------------------+
abbreviation_mapper
: This model maps abbreviations and acronyms of medical regulatory activities with their definitions.
See Model Card for details.
Example:
input = ["""Gravid with estimated fetal weight of 6-6/12 pounds. LABORATORY DATA: Laboratory tests include a CBC which is normal. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""] >> output: +------------+----------------------------+ |Abbreviation|Definition | +------------+----------------------------+ |CBC |complete blood count | |HIV |human immunodeficiency virus| +------------+----------------------------+
rxnorm_action_treatment_mapper
: RxNorm and RxNorm Extension codes with their corresponding action and treatment. Action refers to the function of the drug in various body systems; treatment refers to which disease the drug is used to treat.
See Model Card for details.
Example:
input = ['Sinequan 150 MG', 'Zonalon 50 mg'] >> output: +---------------+------------+---------------+ |chunk |rxnorm_code |Action | +---------------+------------+---------------+ |Sinequan 150 MG|1000067 |Antidepressant | |Zonalon 50 mg |103971 |Analgesic | +---------------+------------+---------------+
rxnorm_ndc_mapper
: This pretrained model maps RxNorm and RxNorm Extension codes with corresponding National Drug Codes (NDC).
See Model Card for details.
Example:
input = ['doxepin hydrochloride 50 MG/ML', 'macadamia nut 100 MG/ML'] >> output: +------------------------------+------------+------------+ |chunk |rxnorm_code |Product NDC | +------------------------------+------------+------------+ |doxepin hydrochloride 50 MG/ML|1000091 |00378-8117 | |macadamia nut 100 MG/ML |212433 |00064-2120 | +------------------------------+------------+------------+