We are extremely excited to announce the release of NLU 3.2, which marks the 1-year anniversary of the birth of this magical library.
This release packs features and improvements in every division of NLU’s aspects, 89 new NLP models with new Models including Longformer
, TokenBert
, TokenDistilBert
and Multi-Lingual NER for 40+ Languages
.
12 new Healthcare models with trainable sentence resolvers
and models Adverse Drug Relations, Clinical Token Bert Models, NER Models for Radiology, Drugs, Posology, Administration Cycles, RXNorm, and new Medical Assertion models.
New Streamlit visualizations enable you to see Entities
in 3-D, 2-D, and 1-D Manifolds which are applicable to Entities and their Embeddings, Detected by Named-Entity-Recognizer models.
Finally, a ~7% decrease in Memory consumption in NLU’s core which benefits every computation, was achieved by leveraging Pyarrow.
We are incredibly thankful to our community, which helped us come this far, and are looking forward to another magical year of NLU!
Streamlit Entity Manifold visualization
function pipe.viz_streamlit_entity_embed_manifold
Visualize recognized entities by NER models via their Entity Embeddings in 1-D
, 2-D
, or 3-D
by Reducing Dimensionality
via 10+ Supported methods from Manifold Algorithms and Matrix Decomposition Algorithms.
You can pick additional NER models and compare them via the GUI dropdown on the left.
- Reduces Dimensionality of high dimensional Entity Embeddings to
1-D
,2-D
, or3-D
and plot the resulting data in an interactivePlotly
plot - Applicable with any of the 330+ Named Entity Recognizer models
- Gemerates
NUM-DIMENSIONS
*NUM-NER-MODELS
*NUM-DIMENSION-REDUCTION-ALGOS
plots
nlu.load('ner').viz_streamlit_sentence_embed_manifold(['Hello From John Snow Labs', 'Peter loves to visit New York'])
or just run
streamlit run
function parameters pipe.viz_streamlit_sentence_embed_manifold
Argument | Type | Default | Description |
---|---|---|---|
default_texts |
List[str] |
“Donald Trump likes to visit New York”, “Angela Merkel likes to visit Berlin!”, ‘Peter hates visiting Paris’) | List of strings to apply classifiers, embeddings, and manifolds to. |
title |
str |
'NLU ❤️ Streamlit - Prototype your NLP startup in 0 lines of code🚀' |
Title of the Streamlit app |
sub_title |
Optional[str] |
“Apply any of the 10+ Manifold or Matrix Decomposition algorithms to reduce the dimensionality of Entity Embeddings to 1-D , 2-D and 3-D “ |
Sub title of the Streamlit app |
default_algos_to_apply |
List[str] |
["TSNE", "PCA"] |
A list Manifold and Matrix Decomposition Algorithms to apply. Can be either 'TSNE' ,'ISOMAP' ,'LLE' ,'Spectral Embedding' , 'MDS' ,'PCA' ,'SVD aka LSA' ,'DictionaryLearning' ,'FactorAnalysis' ,'FastICA' or 'KernelPCA' , |
target_dimensions |
List[int] |
(1,2,3) |
Defines the target dimension embeddings will be reduced to |
show_algo_select |
bool |
True |
Show selector for Manifold and Matrix Decomposition Algorithms |
set_wide_layout_CSS |
bool |
True |
Whether to inject custom CSS or not. |
num_cols |
int |
2 |
How many columns should for the layout in streamlit when rendering the similarity matrixes. |
key |
str |
"NLU_streamlit" |
Key for the Streamlit elements drawn |
show_logo |
bool |
True |
Show logo |
display_infos |
bool |
False |
Display additonal information about ISO codes and the NLU namespace structure. |
n_jobs |
Optional[int] |
3 |
False |
Sentence Entity Resolver Training
Sentence Entity Resolver Training Tutorial Notebook. Named Entities are sub pieces in textual data which are labeled with classes.
These classes and strings are still ambiguous though and it is not possible to group semantically identically entities without any definition of terminology
.
With the Sentence Resolver
you can train a state-of-the-art deep learning architecture to map entities to their unique terminological representation.
Train a Sentence resolver on a dataset with columns named y
, _y
and text
. y
is a label, _y
is an extra identifier label, text
is the raw text:
import pandas as pd import nlu dataset = pd.DataFrame({ 'text': ['The Tesla company is good to invest is', 'TSLA is good to invest','TESLA INC. we should buy','PUT ALL MONEY IN TSLA inc!!'], 'y': ['23','23','23','23'], '_y': ['TESLA','TESLA','TESLA','TESLA'], }) trainable_pipe = nlu.load('train.resolve_sentence') fitted_pipe = trainable_pipe.fit(dataset) res = fitted_pipe.predict(dataset) fitted_pipe.predict(["Peter told me to buy Tesla ", 'I have money to loose, is TSLA a good option?'])
sentence_resolution_resolve_sentence_confidence | sentence_resolution_resolve_sentence_code | sentence_resolution_resolve_sentence | sentence | |
---|---|---|---|---|
0 | ‘1.0000’ | ’23’ | ‘TESLA’ | ‘The Tesla company is good to invest is’ |
1 | ‘1.0000’ | ’23’ | ‘TESLA’ | ‘TSLA is good to invest’ |
2 | ‘1.0000’ | ’23’ | ‘TESLA’ | ‘TESLA INC. we should buy’ |
3 | ‘1.0000’ | ’23’ | ‘TESLA’ | ‘PUT ALL MONEY IN TSLA inc!!’ |
Alternatively, you can also use non-default healthcare embeddings.
trainable_pipe = nlu.load('en.embed.glove.biovec train.resolve_sentence')
Transformer Models
New models from the spectacular Spark NLP 3.2.0 + releases are integrated. 89 new models in total, with new LongFormer
, TokenBert
, TokenDistilBert
and Multi-Lingual NER
for 40+ languages.
The supported languages with their ISO 639-1 code are : af
, ar
, bg
, bn
, de
, el
, en
, es
, et
, eu
, fa
, fi
, fr
, he
, hi
, hu
, id
, it
, ja
, jv
, ka
, kk
, ko
, ml
, mr
, ms
, my
, nl
, pt
, ru
, sw
, ta
, te
, th
, tl
, tr
, ur,
vi
, yo
, and zh
New Healthcare Transformer Models
12 new models from the amazing Spark NLP for Healthcare 3.2.0+ releases, including models for genetic variants
, radiology
, assertion
, rxnorm
, adverse drugs
and new clinical tokenbert
models that improve accuracy by 4% compared to the previous models.
nlu.load() Refrence | Spark NLP Refrence | Annotator Class |
---|---|---|
en.med_ner.radiology.wip_greedy_biobert | jsl_rd_ner_wip_greedy_biobert | MedicalNerModel |
en.med_ner.genetic_variants | ner_genetic_variants | MedicalNerModel |
en.med_ner.jsl_slim | ner_jsl_slim | MedicalNerModel |
en.med_ner.jsl_greedy_biobert | ner_jsl_greedy_biobert | MedicalNerModel |
en.embed.token_bert.ner_clinical | bert_token_classifier_ner_clinical | MedicalNerModel |
en.embed.token_bert.ner_jsl | bert_token_classifier_ner_jsl | MedicalNerModel |
en.relation.ade | redl_ade_biobert | RelationExtractionDLModel |
en.relation.ade_clinical | re_ade_clinical | RelationExtractionDLModel |
en.relation.ade_biobert | re_ade_biobert | RelationExtractionDLModel |
en.resolve.rxnorm_disposition | sbiobertresolve_rxnorm_disposition | SentenceEntityResolverModel |
en.assert.jsl | assertion_jsl | AssertionDLModel |
en.assert.jsl_large | assertion_jsl_large | AssertionDLModel |
PyArrow Memory Optimizations
Optimized integration with Pyarrow to share memory between the Python Virtual Machine and Java Virtual Machine which yields around 7% less memory consumption on average in all computations. This improvement will take effect for everyone using the default Pyspark installation, which comes with a compatible Pyarrow Version.
If you manually install or upgrade Pyarrow, please refer to the official Spark docs and make sure you have a Pyarrow version installed that works with your Pyspark version.
New Notebooks
- Sentence Resolution Training Notebook
- Benchmark Notebook
Additional NLU Resources
- 140+ NLU Tutorials
- Streamlit visualizations docs
- The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.
- Spark NLP publications
- NLU in Action
- NLU documentation
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP and NLU!