Question Answering, Intent Classification, Aspect Based NER, and New Multilingual Models in Python’s NLU Library

12.02.2021

Christian Kasim Loan

Senior Data Scientist at John Snow Labs

We are very excited to release NLU 1.1.1!

This release features 3 new tutorial notebooks for Open/Closed book question answering with Google’s T5, Intent classification, and Aspect Based NER.

In Addition, NLU 1.1.0 comes with 25+ pre-trained models and pipelines in Amharic, Bengali, Bhojpuri, Japanese, and Korean languages from the amazing Spark2.7.2 release. Finally, NLU now supports running on Spark 2.3 clusters.

NLU 1.1.0 New Non-English Models

Language	nlu.load() reference	Spark NLP Model reference	Type
Arabic	ar.ner	arabic_w2v_cc_300d	Named Entity Recognizer
Arabic	ar.embed.aner	aner_cc_300d	Word Embedding
Arabic	ar.embed.aner.300d	aner_cc_300d	Word Embedding (Alias)
Bengali	bn.stopwords	stopwords_bn	Stopwords Cleaner
Bengali	bn.pos	pos_msri	Part of Speech
Thai	th.segment_words	wordseg_best	Word Segmenter
Thai	th.pos	pos_lst20	Part of Speech
Thai	th.sentiment	sentiment_jager_use	Sentiment Classifier
Thai	th.classify.sentiment	sentiment_jager_use	Sentiment Classifier (Alias)
Chinese	zh.pos.ud_gsd_trad	pos_ud_gsd_trad	Part of Speech
Chinese	zh.segment_words.gsd	wordseg_gsd_ud_trad	Word Segmenter
Bihari	bh.pos	pos_ud_bhtb	Part of Speech
Amharic	am.pos	pos_ud_att	Part of Speech

NLU 1.1.1 New English Models and Pipelines

Language	nlu.load() reference	Spark NLP Model reference	Type
English	en.sentiment.glove	analyze_sentimentdl_glove_imdb	Sentiment Classifier
English	en.sentiment.glove.imdb	analyze_sentimentdl_glove_imdb	Sentiment Classifier (Alias)
English	en.classify.sentiment.glove.imdb	analyze_sentimentdl_glove_imdb	Sentiment Classifier (Alias)
English	en.classify.sentiment.glove	analyze_sentimentdl_glove_imdb	Sentiment Classifier (Alias)
English	en.classify.trec50.pipe	classifierdl_use_trec50_pipeline	Language Classifier
English	en.ner.onto.large	onto_recognize_entities_electra_large	Named Entity Recognizer
English	en.classify.questions.atis	classifierdl_use_atis	Intent Classifier
English	en.classify.questions.airline	classifierdl_use_atis	Intent Classifier (Alias)
English	en.classify.intent.atis	classifierdl_use_atis	Intent Classifier (Alias)
English	en.classify.intent.airline	classifierdl_use_atis	Intent Classifier (Alias)
English	en.ner.atis	nerdl_atis_840b_300d	Aspect based NER
English	en.ner.airline	nerdl_atis_840b_300d	Aspect based NER (Alias)
English	en.ner.aspect.airline	nerdl_atis_840b_300d	Aspect based NER (Alias)
English	en.ner.aspect.atis	nerdl_atis_840b_300d	Aspect based NER (Alias)

New Easy NLU 1-liner Examples:

Extract aspects and entities from airline questions (ATIS dataset)

      nlu.load("en.ner.atis").predict("i want to fly from baltimore to dallas round trip")
      output:  ["baltimore"," dallas", "round trip"]

Intent Classification for Airline Traffic Information System queries (ATIS dataset)

      nlu.load("en.classify.questions.atis").predict("what is the price of flight from newyork to washington")
      output:  "atis_airfare"

Recognize Entities OntoNotes – ELECTRA Large

      nlu.load("en.ner.onto.large").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London.")	
      output:  ["Johnson", "first", "2001", "eight years", "London"]

Question classification of open-domain and fact-based questions Pipeline – TREC50

      nlu.load("en.classify.trec50.pipe").predict("When did the construction of stone circles begin in the UK? ")
      output: LOC_other

Traditional Chinese Word Segmentation

      # 'However, this treatment also creates some problems' in Chinese
      nlu.load("zh.segment_words.gsd").predict("然而，這樣的處理也衍生了一些問題。")
      output:  ["然而",",","這樣","的","處理","也","衍生","了","一些","問題","。"]

Part of Speech for Traditional Chinese

      # 'However, this treatment also creates some problems' in Chinese
      nlu.load("zh.pos.ud_gsd_trad").predict("然而，這樣的處理也衍生了一些問題。")

Output:

Token	POS
然而	ADV
，	PUNCT
這樣	PRON
的	PART
處理	NOUN
也	ADV
衍生	VERB
了	PART
一些	ADJ
問題	NOUN
。	PUNCT

Thai Word Segment Recognition

      # 'Mona Lisa is a 16th-century oil painting created by Leonardo held at the Louvre in Paris' in Thai
      nlu.loadnlu.load("th.segment_words").predict("Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส")

Output:

token
M
o
n
a
Lisa
เป็น
ภาพ
ว
า
ด
สีน้ำ
มัน
ใน
ศตวรรษ
ที่
16
ที่
สร้าง
โ
ด
ย
L
e
o
n
a
r
d
o
จัด
ขึ้น
ที่
พิพิธภัณฑ์
ลูฟร์
ใน
ปารีส

Part of Speech for Bengali (POS)

      # 'The village is also called 'Mod' in Tora language' in Bengali 
      nlu.load("bn.pos").predict("বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷")

Output:

token	pos
বাসস্থান-ঘরগৃহস্থালি	NN
তোড়া	NNP
ভাষায়	NN
গ্রামকেও	NN
বলে	VM
`	SYM
মোদ	NN
‘	SYM
৷	SYM

Stop Words Cleaner for Bengali

      # 'This language is not enough' in Bengali 
      df = nlu.load("bn.stopwords").predict("এই ভাষা যথেষ্ট নয়")

Output:

cleanTokens	token
ভাষা	এই
যথেষ্ট	ভাষা
নয়	যথেষ্ট
None	নয়

Part of Speech for Bengali

      # 'The people of Ohu know that the foundation of Bhojpuri was shaken' in Bengali
      nlu.load('bh.pos').predict("ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई")

Output:

pos	token
DET	ओहु
NOUN	लोग
ADP	के
NOUN	मालूम
VERB	बा
SCONJ	कि
ADJ	श्लील
VERB	होखते
PROPN	भोजपुरी
ADP	के
NOUN	नींव
VERB	हिल
AUX	जाई

Amharic Part of Speech (POS)

      # ' "Son, finish the job," he said.' in Amharic
      nlu.load('am.pos').predict('ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ"')

Output:

pos	token
NOUN	ልጅ
DET	ኡ
PART	ን
NOUN	ሥራ
DET	ው
PART	ን
VERB	አስጨርስ
PRON	ኧው
AUX	ኣል
PRON	ኧሁ
PUNCT	።
NOUN	“

Thai Sentiment Classification

      #  'I love peanut butter and jelly!' in thai
      nlu.load('th.classify.sentiment').predict('ฉันชอบเนยถั่วและเยลลี่!')[['sentiment','sentiment_confidence']]

Output:

sentiment	sentiment_confidence
positive	0.999998

Arabic Named Entity Recognition (NER)

      # 'In 1918, the forces of the Arab Revolt liberated Damascus with the help of the British' in Arabic
      nlu.load('ar.ner').predict('في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز',output_level='chunk')[['entities_confidence','ner_confidence','entities']]

Output:

entity_class	ner_confidence	entities
ORG	[1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669]	قوات الثورة العربية
LOC	[1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669]	دمشق
PER	[1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669]	الإنكليز

NLU 1.1.0 Enhancements

Spark 2.3 compatibility

New NLU Notebooks and Tutorials

Intent Classification for Airline emssages ATIS

Installation

      # PyPi
      !pip install nlu pyspark==2.4.7
      #Conda
      # Install NLU from Anaconda/Conda
      conda install -c johnsnowlabs nlu

Additional NLU resources

Christian Kasim Loan

Senior Data Scientist at John Snow Labs

Our additional expert:

Christian Kasim Loan is a computer scientist with over 10 years of coding experience who works for John Snow Labs as a Senior Data Scientist where he helps porting the latest and greatest Machine Learning Models to Spark and created the NLU library.

John Snow Labs NLU 1.1 release adds 720+ new NLP models, 300+ supported languages, translation, summarization, question answering, and more!

Christian Kasim Loan

We are incredibly excited to release NLU 1.1! This release integrates the 720+ new models from the latest Spark-NLP 2.7 + releases....

Question Answering, Intent Classification, Aspect Based NER, and New Multilingual Models in Python’s NLU Library

NLU 1.1.0 New Non-English Models

NLU 1.1.1 New English Models and Pipelines

New Easy NLU 1-liner Examples:

Extract aspects and entities from airline questions (ATIS dataset)

Intent Classification for Airline Traffic Information System queries (ATIS dataset)

Recognize Entities OntoNotes – ELECTRA Large

Question classification of open-domain and fact-based questions Pipeline – TREC50

Traditional Chinese Word Segmentation

Part of Speech for Traditional Chinese

Output:

Thai Word Segment Recognition

Output:

Part of Speech for Bengali (POS)

Output:

Stop Words Cleaner for Bengali

Output:

Part of Speech for Bengali

Output:

Amharic Part of Speech (POS)

Output:

Thai Sentiment Classification

Output:

Arabic Named Entity Recognition (NER)

Output:

NLU 1.1.0 Enhancements

New NLU Notebooks and Tutorials

Installation

Additional NLU resources

John Snow Labs NLU 1.1 release adds 720+ new NLP models, 300+ supported languages, translation, summarization, question answering, and more!

Recommended For You