We are extremely excited to announce the release of NLU 3.1! This is our biggest release so far and it comes with over 2600+ new models in 200+ languages, including DistilBERT, RoBERTa, and XLM-RoBERTa and Huggingface based Embeddings from the incredible Spark-NLP 3.1.0 release, new Streamlit Visualizations for visualizing Word Embeddings in 3-D, 2-D, and 1-D, new Healthcare pipelines for healthcare code mappings and finally confidence extraction for open source NER models. Additionally, the NLU Namespace has been renamed to the NLU Spellbook, to reflect the magicalness of each 1-liners represented by them!
Streamlit Word Embedding visualization via Manifold and Matrix Decomposition algorithms
function pipe.viz_streamlit_word_embed_manifold
Visualize Word Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 11 Supported methods from Manifold Algorithms and Matrix Decomposition Algorithms. Additionally, you can color the lower dimensional points with a label that has been previously assigned to the text by specifying a list of nlu references in the additional_classifiers_for_coloring parameter.
- Reduces Dimensionality of high dimensional Word Embeddings to 1-D,2-D, or3-Dand plot the resulting data in an interactivePlotlyplot
- Applicable with any of the 100+ Word Embedding models
- Color points by classifying with any of the 100+ Parts of Speech Classifiers or Document Classifiers
- Generates NUM-DIMENSIONS*NUM-EMBEDDINGS*NUM-DIMENSION-REDUCTION-ALGOSplots
texts = ['You can visualize any of the 100 + embeddings','with 10+ dimension reduction algorithms',<br>'and view the results in 3D, 2D, and 1D which can be colored by various classifier labels!',]
nlu.load('bert').viz_streamlit_word_embed_manifold(default_texts=texts)
Dimension reduction techniques applied to BERT embeddings to view them in 1-D, 2-D and 3-D
Function parameters pipe.viz_streamlit_word_embed_manifold
| Argument | Type | Default | Description | 
|---|---|---|---|
| default_texts | List[str] | (“Donald Trump likes to party!”, “Angela Merkel likes to party!”, ‘Peter HATES TO PARTTY!!!! :(‘) | List of strings to apply classifiers, embeddings, and manifolds to. | 
| text | Optional[str] | 'Billy likes to swim' | Text to predict classes for. | 
| sub_title | Optional[str] | “Apply any of the 11 ManifoldorMatrix Decompositionalgorithms to reduce the dimensionality ofWord Embeddingsto1-D,2-Dand3-D“ | Sub title of the Streamlit app | 
| default_algos_to_apply | List[str] | ["TSNE", "PCA"] | A list Manifold and Matrix Decomposition Algorithms to apply. Can be either 'TSNE','ISOMAP','LLE','Spectral Embedding','MDS','PCA','SVD aka LSA','DictionaryLearning','FactorAnalysis','FastICA'or'KernelPCA', | 
| target_dimensions | List[int] | (1,2,3) | Defines the target dimension embeddings will be reduced to | 
| show_algo_select | bool | True | Show selector for Manifold and Matrix Decomposition Algorithms | 
| show_embed_select | bool | True | Show selector for Embedding Selection | 
| show_color_select | bool | True | Show selector for coloring plots | 
| MAX_DISPLAY_NUM | int | 100 | Cap maximum number of Tokens displayed | 
| display_embed_information | bool | True | Show additional embedding information like dimension,nlu_reference,spark_nlp_reference,sotrage_reference,modelhub linkand more. | 
| set_wide_layout_CSS | bool | True | Whether to inject custom CSS or not. | 
| num_cols | int | 2 | How many columns should for the layout in streamlit when rendering the similarity matrixes. | 
| key | str | "NLU_streamlit" | Key for the Streamlit elements drawn | 
| additional_classifiers_for_coloring | List[str] | ['pos', 'sentiment.imdb'] | List of additional NLU references to load for generting hue colors | 
| show_model_select | bool | True | Show a model selection dropdowns that makes any of the 1000+ models avaiable in 1 click | 
| model_select_position | str | 'side' | Whether to output the positions of predictions or not, see pipe.predict(positions=true) for more info | 
| show_logo | bool | True | Show logo | 
| display_infos | bool | False | Display additonal information about ISO codes and the NLU namespace structure. | 
| n_jobs | Optional[int] | 3 | False | 
Larger Example showcasing more dimension reduction techniques on a larger corpus:
See the Matrix movie script in 3-D from the perspective of BERT or any other Transformer and Embedding!
Supported Manifold Algorithms
Supported Matrix Decomposition Algorithms
New Healthcare Pipelines Pipelines
Five new healthcare code mapping pipelines:
- nlu.load(en.resolve.icd10cm.umls): This pretrained pipeline maps ICD10CM codes to UMLS codes without using any text data. You’ll just feed white space-delimited ICD10CM codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.
{'icd10cm': ['M89.50', 'R82.2', 'R09.01'],'umls': ['C4721411', 'C0159076', 'C0004044']}
- nlu.load(en.resolve.mesh.umls): This pretrained pipeline maps MeSH codes to UMLS codes without using any text data. You’ll just feed white space-delimited MeSH codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.
{'mesh': ['C028491', 'D019326', 'C579867'],'umls': ['C0970275', 'C0886627', 'C3696376']}
- nlu.load(en.resolve.rxnorm.umls): This pretrained pipeline maps RxNorm codes to UMLS codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.
{'rxnorm': ['1161611', '315677', '343663'],'umls': ['C3215948', 'C0984912', 'C1146501']}
- nlu.load(en.resolve.rxnorm.mesh): This pretrained pipeline maps RxNorm codes to MeSH codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding MeSH codes as a list. If there is no mapping, the original code is returned with no mapping.
{'rxnorm': ['1191', '6809', '47613'],'mesh': ['D001241', 'D008687', 'D019355']}
- nlu.load(en.resolve.snomed.umls): This pretrained pipeline maps SNOMED codes to UMLS codes without using any text data. You’ll just feed white space-delimited SNOMED codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.
{'snomed': ['733187009', '449433008', '51264003'],'umls': ['C4546029', 'C3164619', 'C0271267']}
New Healthcare Pipelines
.| NLU Reference | Spark NLP Reference | 
|---|---|
| en.resolve.icd10cm.umls | icd10cm_umls_mapping | 
| en.resolve.mesh.umls | mesh_umls_mapping | 
| en.resolve.rxnorm.umls | rxnorm_umls_mapping | 
| en.resolve.rxnorm.mesh | rxnorm_mesh_mapping | 
| en.resolve.snomed.umls | snomed_umls_mapping | 
| en.explain_doc.carp | explain_clinical_doc_carp | 
| en.explain_doc.era | explain_clinical_doc_era | 
New Open Source Models and Pipelines
.1 line Install NLU on Google Colab
!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
1 line Install NLU on Kaggle
!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash
Install via PIP
! pip install nlu pyspark==3.0.3





























