was successfully added to your cart.

    Spark NLP 3.1: 2600+ new models and pipelines in 200+ languages and new DistilBERT, RoBERTa, & XLM-RoBERTa transformers

    Avatar photo
    Senior Data Scientist and Spark NLP Lead at John Snow Labs

    We are very excited to release Spark NLP 3.1 today!

    This is one of our biggest releases with lots of models, pipelines, and groundworks for future features.

    Spark NLP 3.1 comes with over 2600+ new pretrained models and pipelines in over 200+ languages, new DistilBERT, RoBERTa, and XLM-RoBERTa annotators, support for HuggingFace 🤗 (Autoencoding) models in Spark NLP, and extends support for new Databricks and EMR instances.

    As always, we would like to thank our community for their feedback, questions, and feature requests.

     

    Major features and improvements

    • NEW: Introducing DistiBertEmbeddings annotator. DistilBERT is a small, fast, cheap, and light Transformer model trained by distilling BERT base. It has 40% fewer parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances
    • NEW: Introducing RoBERTaEmbeddings annotator. RoBERTa (Robustly Optimized BERT-Pretraining Approach) models deliver state-of-the-art performance on NLP/NLU tasks and a sizable performance improvement on the GLUE benchmark. With a score of 88.5, RoBERTa reached the top position on the GLUE leaderboard
    • NEW: Introducing XlmRoBERTaEmbeddings annotator. XLM-RoBERTa (Unsupervised Cross-lingual Representation Learning at Scale) is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data with 100 different languages. It also outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model
    • NEW: Introducing support for HuggingFace exported models in equivalent Spark NLP annotators. Starting this release, you can easily use the saved_model feature in HuggingFace within a few lines of codes and import any BERT, DistilBERT, RoBERTa, and XLM-RoBERTa models to Spark NLP. We will work on the remaining annotators and extend this support to the rest with each release – For more information please visit this discussion
    • NEW: Migrate MarianTransformer to BatchAnnotate to control the throughput when you are on accelerated hardware such as GPU to fully utilize it
    • Upgrade to TensorFlow v2.4.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x
    • Update to CUDA11 and cuDNN 8.0.2 for GPU support
    • Implement ModelSignatureManager to automatically detect inputs, outputs, save and restore tensors from SavedModel in TF v2. This allows Spark NLP 3.1.x to extend support for external Encoders such as HuggingFace and TF Hub (coming soon!)
    • Implement a new BPE tokenizer for RoBERTa and XLM models. This tokenizer will use the custom tokens from Tokenizer or RegexTokenizer and generates token pieces, encodes, and decodes the results
    • Welcoming new Databricks runtimes to our Spark NLP family:
      • Databricks 8.1 ML & GPU
      • Databricks 8.2 ML & GPU
      • Databricks 8.3 ML & GPU
    • Welcoming a new EMR 6.x series to our Spark NLP family:
      • EMR 6.3.0 (Apache Spark 3.1.1 / Hadoop 3.2.1)
    • Added examples to Spark NLP Scaladoc

     

    Models and Pipelines

    Spark NLP 3.1.0 comes with over 2600+ new pretrained models and pipelines in over 200 languages available for Windows, Linux, and macOS users.

     

    Featured Transformers

    Model Name Build Lang
    XlmRoBertaEmbeddings twitter_xlm_roberta_base 3.1.0 xx
    XlmRoBertaEmbeddings xlm_roberta_base 3.1.0 xx
    RoBertaEmbeddings distilroberta_base 3.1.0 en
    RoBertaEmbeddings roberta_large 3.1.0 en
    RoBertaEmbeddings roberta_base 3.1.0 en
    DistilBertEmbeddings distilbert_base_multilingual_cased 3.1.0 xx
    DistilBertEmbeddings distilbert_base_uncased 3.1.0 en
    DistilBertEmbeddings distilbert_base_cased 3.1.0 en
    BertEmbeddings bert_base_chinese 3.1.0 zh
    BertEmbeddings chinese_bert_wwm 3.1.0 zh
    BertEmbeddings bert_base_turkish_uncased 3.1.0 tr
    BertEmbeddings bert_base_turkish_cased 3.1.0 tr
    BertEmbeddings bert_base_italian_uncased 3.1.0 it
    BertEmbeddings bert_base_italian_cased 3.1.0 it
    BertEmbeddings bert_base_german_uncased 3.1.0 de
    BertEmbeddings bert_base_german_cased 3.1.0 de
    BertEmbeddings bert_base_dutch_cased 3.1.0 nl

     

    Featured Translation Models

    Model Name Build Lang
    MarianTransformer Chinese to Vietnamese 3.1.0 xx
    MarianTransformer Chinese to Ukrainian 3.1.0 xx
    MarianTransformer Chinese to Dutch 3.1.0 xx
    MarianTransformer Chinese to English 3.1.0 xx
    MarianTransformer Chinese to Finnish 3.1.0 xx
    MarianTransformer Chinese to Italian 3.1.0 xx
    MarianTransformer Yoruba to English 3.1.0 xx
    MarianTransformer Yapese to French 3.1.0 xx
    MarianTransformer Waray to Spanish 3.1.0 xx
    MarianTransformer Ukrainian to English 3.1.0 xx
    MarianTransformer Hindi to Urdu 3.1.0 xx
    MarianTransformer Italian to Ukrainian 3.1.0 xx
    MarianTransformer Italian to Icelandic 3.1.0 xx

     

    Transformers in Spark NLP

    Import hundreds of models in different languages to Spark NLP

     

    Spark NLP HuggingFace Notebooks
    BertEmbeddings HuggingFace in Spark NLP – BERT
    BertSentenceEmbeddings HuggingFace in Spark NLP – BERT Sentence
    DistilBertEmbeddings HuggingFace in Spark NLP – DistilBERT
    RoBertaEmbeddings HuggingFace in Spark NLP – RoBERTa
    XlmRoBertaEmbeddings HuggingFace in Spark NLP – XLM-RoBERTa

     

    The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.

     

    Backward compatibility

    We have updated our MarianTransformer annotator to be compatible with TF v2 models. This change is not compatible with previous models/pipelines. However, we have updated and uploaded all the models and pipelines for 3.1.x release. You can either use MarianTransformer.pretrained(MODEL_NAME) and it will automatically download the compatible model or you can visit Models Hub to download the compatible models for offline use via MarianTransformer.load(PATH)

    To get started and learn:

    Please put it to good use!

    How useful was this post?

    Avatar photo
    Senior Data Scientist and Spark NLP Lead at John Snow Labs
    Our additional expert:
    Maziyar Panahi is a Senior Data Scientist and Spark NLP Lead at John Snow Labs with over a decade long experience in public research. He is a senior Big Data engineer and a Cloud architect with extensive experience in computer networks and software engineering. He has been developing software and planning networks for the last 15 years. In the past, he also worked as a network engineer in high-level places after he completed his Microsoft and Cisco training (MCSE, MCSA, and CCNA).

    Speaking of John Snow Labs - Here’s Our Spring 2021 Event Lineup

    We’ve been busy here at John Snow Labs over the last several months. Despite the slow pickup of in-person events in 2021,...
    preloader