100+ Transformer Models in 40+ languages, Streamlit Entity Manifold visualizations, Trainable Sentence Resolvers, Memory Optimization, and much more in NLU

14.09.2021

Christian Kasim Loan

Senior Data Scientist at John Snow Labs

We are extremely excited to announce the release of NLU 3.2, which marks the 1-year anniversary of the birth of this magical library.

This release packs features and improvements in every division of NLU’s aspects, 89 new NLP models with new Models including Longformer, TokenBert, TokenDistilBert and Multi-Lingual NER for 40+ Languages.

12 new Healthcare models with trainable sentence resolvers and models Adverse Drug Relations, Clinical Token Bert Models, NER Models for Radiology, Drugs, Posology, Administration Cycles, RXNorm, and new Medical Assertion models.

New Streamlit visualizations enable you to see Entities in 3-D, 2-D, and 1-D Manifolds which are applicable to Entities and their Embeddings, Detected by Named-Entity-Recognizer models.

Finally, a ~7% decrease in Memory consumption in NLU’s core which benefits every computation, was achieved by leveraging Pyarrow.

We are incredibly thankful to our community, which helped us come this far, and are looking forward to another magical year of NLU!

Streamlit Entity Manifold visualization

function `pipe.viz_streamlit_entity_embed_manifold`

Visualize recognized entities by NER models via their Entity Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 10+ Supported methods from Manifold Algorithms and Matrix Decomposition Algorithms.

You can pick additional NER models and compare them via the GUI dropdown on the left.

Reduces Dimensionality of high dimensional Entity Embeddings to 1-D, 2-D, or 3-D and plot the resulting data in an interactive Plotly plot
Applicable with any of the 330+ Named Entity Recognizer models
Gemerates NUM-DIMENSIONS * NUM-NER-MODELS * NUM-DIMENSION-REDUCTION-ALGOS plots

nlu.load('ner').viz_streamlit_sentence_embed_manifold(['Hello From John Snow Labs', 'Peter loves to visit New York'])

or just run

streamlit run

https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/examples/streamlit/09_entity_embedding_manifolds.py

function parameters `pipe.viz_streamlit_sentence_embed_manifold`

Argument	Type	Default	Description
`default_texts`	`List[str]`	“Donald Trump likes to visit New York”, “Angela Merkel likes to visit Berlin!”, ‘Peter hates visiting Paris’)	List of strings to apply classifiers, embeddings, and manifolds to.
`title`	`str`	`'NLU ❤️ Streamlit - Prototype your NLP startup in 0 lines of code🚀'`	Title of the Streamlit app
`sub_title`	`Optional[str]`	“Apply any of the 10+ `Manifold` or `Matrix Decomposition`algorithms to reduce the dimensionality of `Entity Embeddings` to `1-D`, `2-D` and `3-D`“	Sub title of the Streamlit app
`default_algos_to_apply`	`List[str]`	`["TSNE", "PCA"]`	A list Manifold and Matrix Decomposition Algorithms to apply. Can be either `'TSNE'`,`'ISOMAP'`,`'LLE'`,`'Spectral Embedding'`, `'MDS'`,`'PCA'`,`'SVD aka LSA'`,`'DictionaryLearning'`,`'FactorAnalysis'`,`'FastICA'`or `'KernelPCA'`,
`target_dimensions`	`List[int]`	`(1,2,3)`	Defines the target dimension embeddings will be reduced to
`show_algo_select`	`bool`	`True`	Show selector for Manifold and Matrix Decomposition Algorithms
`set_wide_layout_CSS`	`bool`	`True`	Whether to inject custom CSS or not.
`num_cols`	`int`	`2`	How many columns should for the layout in streamlit when rendering the similarity matrixes.
`key`	`str`	`"NLU_streamlit"`	Key for the Streamlit elements drawn
`show_logo`	`bool`	`True`	Show logo
`display_infos`	`bool`	`False`	Display additonal information about ISO codes and the NLU namespace structure.
`n_jobs`	`Optional[int]`	`3`	`False`

Sentence Entity Resolver Training

Sentence Entity Resolver Training Tutorial Notebook. Named Entities are sub pieces in textual data which are labeled with classes.

These classes and strings are still ambiguous though and it is not possible to group semantically identically entities without any definition of terminology.

With the Sentence Resolver you can train a state-of-the-art deep learning architecture to map entities to their unique terminological representation.

Train a Sentence resolver on a dataset with columns named y , _y and text. y is a label, _y is an extra identifier label, text is the raw text:

    import pandas as pd 
    import nlu
    dataset = pd.DataFrame({
        'text': ['The Tesla company is good to invest is', 'TSLA is good to invest','TESLA INC. we should buy','PUT ALL MONEY IN TSLA inc!!'],
        'y': ['23','23','23','23'],
        '_y': ['TESLA','TESLA','TESLA','TESLA'],

    })

    trainable_pipe = nlu.load('train.resolve_sentence')
    fitted_pipe  = trainable_pipe.fit(dataset)
    res = fitted_pipe.predict(dataset)
    fitted_pipe.predict(["Peter told me to buy Tesla ", 'I have money to loose, is TSLA a good option?'])

	sentence_resolution_resolve_sentence_confidence	sentence_resolution_resolve_sentence_code	sentence_resolution_resolve_sentence	sentence
0	‘1.0000’	’23’	‘TESLA’	‘The Tesla company is good to invest is’
1	‘1.0000’	’23’	‘TESLA’	‘TSLA is good to invest’
2	‘1.0000’	’23’	‘TESLA’	‘TESLA INC. we should buy’
3	‘1.0000’	’23’	‘TESLA’	‘PUT ALL MONEY IN TSLA inc!!’

Alternatively, you can also use non-default healthcare embeddings.

trainable_pipe = nlu.load('en.embed.glove.biovec train.resolve_sentence')

Transformer Models

New models from the spectacular Spark NLP 3.2.0 + releases are integrated. 89 new models in total, with new LongFormer, TokenBert, TokenDistilBert and Multi-Lingual NER for 40+ languages.

The supported languages with their ISO 639-1 code are : af, ar, bg, bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it, ja, jv, ka, kk, ko, ml, mr, ms, my, nl, pt, ru, sw, ta, te, th, tl, tr, ur, vi, yo, and zh

nlu.load() Refrence	Spark NLP Refrence	Annotator Class	language
en.embed.longformer	longformer_base_4096	LongformerEmbeddings	en
en.embed.longformer.large	longformer_large_4096	LongformerEmbeddings	en
en.ner.ontonotes_roberta_base	ner_ontonotes_roberta_base	NerDLModel	en
en.ner.ontonotes_roberta_large	ner_ontonotes_roberta_large	NerDLModel	en
en.ner.ontonotes_distilbert_base_cased	ner_ontonotes_distilbert_base_cased	NerDLModel	en
en.ner.conll_bert_base_cased	ner_conll_bert_base_cased	NerDLModel	en
en.ner.conll_distilbert_base_cased	ner_conll_distilbert_base_cased	NerDLModel	en
en.ner.conll_roberta_base	ner_conll_roberta_base	NerDLModel	en
en.ner.conll_roberta_large	ner_conll_roberta_large	NerDLModel	en
en.ner.conll_xlm_roberta_base	ner_conll_xlm_roberta_base	NerDLModel	en
en.ner.conll_longformer_large_4096	ner_conll_longformer_large_4096	NerDLModel	en
en.embed.token_bert.conll03	bert_base_token_classifier_conll03	NerDLModel	en
en.embed.token_bert.large_conll03	bert_large_token_classifier_conll03	NerDLModel	en
en.embed.token_bert.ontonote	bert_base_token_classifier_ontonote	NerDLModel	en
en.embed.token_bert.large_ontonote	bert_large_token_classifier_ontonote	NerDLModel	en
en.embed.token_bert.few_nerd	bert_base_token_classifier_few_nerd	NerDLModel	en
fa.embed.token_bert.parsbert_armanner	bert_token_classifier_parsbert_armanner	NerDLModel	fa
fa.embed.token_bert.parsbert_ner	bert_token_classifier_parsbert_ner	NerDLModel	fa
fa.embed.token_bert.parsbert_peymaner	bert_token_classifier_parsbert_peymaner	NerDLModel	fa
tr.embed.token_bert.turkish_ner	bert_token_classifier_turkish_ner	NerDLModel	tr
es.embed.token_bert.spanish_ner	bert_token_classifier_spanish_ner	NerDLModel	es
sv.embed.token_bert.swedish_ner	bert_token_classifier_swedish_ner	NerDLModel	sv
en.ner.fewnerd	nerdl_fewnerd_100d	NerDLModel	en
en.ner.fewnerd_subentity	nerdl_fewnerd_subentity_100d	NerDLModel	en
en.ner.movie	ner_mit_movie_complex_bert_base_cased	NerDLModel	en
en.ner.movie_complex	ner_mit_movie_complex_bert_base_cased	NerDLModel	en
en.ner.movie_simple	ner_mit_movie_complex_bert_base_cased	NerDLModel	en
en.ner.mit_movie_complex_bert	ner_mit_movie_complex_bert_base_cased	NerDLModel	en
en.ner.mit_movie_complex_distilbert	ner_mit_movie_complex_distilbert_base_cased	NerDLModel	en
en.ner.mit_movie_simple	ner_mit_movie_simple_distilbert_base_cased	NerDLModel	en
en.embed_sentence.bert_use_cmlm_en_base	sent_bert_use_cmlm_en_base	BertSentenceEmbeddings	en
en.embed_sentence.bert_use_cmlm_en_large	sent_bert_use_cmlm_en_large	BertSentenceEmbeddings	en
xx.ner.xtreme_glove_840B_300	ner_xtreme_glove_840B_300	NerDLModel	xx
xx.ner.xtreme_xlm_roberta_xtreme_base	ner_xtreme_xlm_roberta_xtreme_base	NerDLModel	xx
xx.ner.wikiner_glove_840B_300	ner_wikiner_glove_840B_300	NerDLModel	xx
xx.ner.wikiner_xlm_roberta_base	ner_wikiner_xlm_roberta_base	NerDLModel	xx
xx.embed_sentence.bert_use_cmlm_multi_base_br	sent_bert_use_cmlm_multi_base_br	BertSentenceEmbeddings	xx
xx.embed_sentence.bert_use_cmlm_multi_base	sent_bert_use_cmlm_multi_base	BertSentenceEmbeddings	xx
xx.embed.xlm_roberta_xtreme_base	xlm_roberta_xtreme_base	XlmRoBertaEmbeddings	xx
xx.embed.bert_base_multilingual_cased	bert_base_multilingual_cased	Embeddings	xx
xx.embed.bert_base_multilingual_uncased	bert_base_multilingual_uncased	Embeddings	xx
xx.af.translate_to.ru	opus_tatoeba_af_ru	Translation	xx
xx.he.translate_to.fr	opus_tatoeba_he_fr	Translation	xx
xx.it.translate_to.he	opus_tatoeba_it_he	Translation	xx
xx.cs.translate_to.sv	opus_mt_cs_sv	Translation	xx
tr.classify.cyberbullying	classifierdl_berturk_cyberbullying	Pipelines	tr
zh.embed.xlnet	chinese_xlnet_base	Embeddings	zh
de.classify.news	classifierdl_bert_news	Pipelines	de
tr.classify.berturk_cyberbullying	classifierdl_berturk_cyberbullying_pipeline	Pipelines	tr
de.classify.bert_news	classifierdl_bert_news_pipeline	Pipelines	de
en.classify.electra_questionpair	classifierdl_electra_questionpair_pipeline	Pipelines	en
tr.classify.bert_news	classifierdl_bert_news_pipeline	Pipelines	tr
en.ner.conll_elmo	ner_conll_elmo	NerDLModel	en
en.ner.conll_albert_base_uncased	ner_conll_albert_base_uncased	NerDLModel	en
en.ner.conll_albert_large_uncased	ner_conll_albert_large_uncased	NerDLModel	en
en.ner.conll_xlnet_base_cased	ner_conll_xlnet_base_cased	NerDLModel	en
xx.embed.bert.muril	bert_muril	BertEmbeddings	xx
en.embed.bert.wiki_books_sst2	bert_wiki_books_sst2	BertEmbeddings	en
en.embed.bert.wiki_books_squad2	bert_wiki_books_squad2	BertEmbeddings	en
en.embed.bert.wiki_books_qqp	bert_wiki_books_qqp	BertEmbeddings	en
en.embed.bert.wiki_books_qnli	bert_wiki_books_qnli	BertEmbeddings	en
en.embed.bert.wiki_books_mnli	bert_wiki_books_mnli	BertEmbeddings	en
en.embed.bert.wiki_books	bert_wiki_books	BertEmbeddings	en
en.embed.bert.pubmed_squad2	bert_pubmed_squad2	BertEmbeddings	en
en.embed.bert.pubmed	bert_pubmed	BertEmbeddings	en
en.embed_sentence.bert.wiki_books_sst2	sent_bert_wiki_books_sst2	BertSentenceEmbeddings	en
en.embed_sentence.bert.wiki_books_squad2	sent_bert_wiki_books_squad2	BertSentenceEmbeddings	en
en.embed_sentence.bert.wiki_books_qqp	sent_bert_wiki_books_qqp	BertSentenceEmbeddings	en
en.embed_sentence.bert.wiki_books_qnli	sent_bert_wiki_books_qnli	BertSentenceEmbeddings	en
en.embed_sentence.bert.wiki_books_mnli	sent_bert_wiki_books_mnli	BertSentenceEmbeddings	en
en.embed_sentence.bert.wiki_books	sent_bert_wiki_books	BertSentenceEmbeddings	en
en.embed_sentence.bert.pubmed_squad2	sent_bert_pubmed_squad2	BertSentenceEmbeddings	en
en.embed_sentence.bert.pubmed	sent_bert_pubmed	BertSentenceEmbeddings	en
xx.embed_sentence.bert.muril	sent_bert_muril	BertSentenceEmbeddings	xx
yi.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	yi
uk.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	uk
te.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	te
ta.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	ta
so.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	so
sd.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	sd
ru.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	ru
pa.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	pa
ne.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	ne
mr.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	mr
ml.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	ml
kn.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	kn
bs.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	bs
id.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	id
gu.detect_sentence	sentence_detector_dl	SentenceDetectorDLModel	gu

New Healthcare Transformer Models

12 new models from the amazing Spark NLP for Healthcare 3.2.0+ releases, including models for genetic variants, radiology, assertion, rxnorm, adverse drugs and new clinical tokenbert models that improve accuracy by 4% compared to the previous models.

nlu.load() Refrence	Spark NLP Refrence	Annotator Class
en.med_ner.radiology.wip_greedy_biobert	jsl_rd_ner_wip_greedy_biobert	MedicalNerModel
en.med_ner.genetic_variants	ner_genetic_variants	MedicalNerModel
en.med_ner.jsl_slim	ner_jsl_slim	MedicalNerModel
en.med_ner.jsl_greedy_biobert	ner_jsl_greedy_biobert	MedicalNerModel
en.embed.token_bert.ner_clinical	bert_token_classifier_ner_clinical	MedicalNerModel
en.embed.token_bert.ner_jsl	bert_token_classifier_ner_jsl	MedicalNerModel
en.relation.ade	redl_ade_biobert	RelationExtractionDLModel
en.relation.ade_clinical	re_ade_clinical	RelationExtractionDLModel
en.relation.ade_biobert	re_ade_biobert	RelationExtractionDLModel
en.resolve.rxnorm_disposition	sbiobertresolve_rxnorm_disposition	SentenceEntityResolverModel
en.assert.jsl	assertion_jsl	AssertionDLModel
en.assert.jsl_large	assertion_jsl_large	AssertionDLModel

PyArrow Memory Optimizations

Optimized integration with Pyarrow to share memory between the Python Virtual Machine and Java Virtual Machine which yields around 7% less memory consumption on average in all computations. This improvement will take effect for everyone using the default Pyspark installation, which comes with a compatible Pyarrow Version.

If you manually install or upgrade Pyarrow, please refer to the official Spark docs and make sure you have a Pyarrow version installed that works with your Pyspark version.

New Notebooks

Sentence Resolution Training Notebook
Benchmark Notebook

Additional NLU Resources

140+ NLU Tutorials
Streamlit visualizations docs
The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.
Spark NLP publications
NLU in Action
NLU documentation
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP and NLU!

Christian Kasim Loan

Senior Data Scientist at John Snow Labs

Our additional expert:

Christian Kasim Loan is a computer scientist with over 10 years of coding experience who works for John Snow Labs as a Senior Data Scientist where he helps porting the latest and greatest Machine Learning Models to Spark and created the NLU library.

Active Learning for Document Classification, Multilingual Embeddings, and Live Training Logs in the Annotation Lab

Nabin Khadka

A new generation of the NLP Lab is now available: the Generative AI Lab. Check details here https://www.johnsnowlabs.com/nlp-lab/ We are very excited...