Redesign of embedding models
Recent developments in NLP rely on vector representations of text, commonly known as embeddings. To support the utilization, training, and fine-tuning of models for the legal domain, Legal NLP is introducing new models: word embedding models, which generate vector representations of words (or tokens), and sentence embedding models, which create vector representations for longer pieces of text such as sentences, paragraphs, and documents.
Word Embedding model
For word embedding models, we aimed to create fast and lightweight models that can be used on token classification tasks such as Named Entity Recognition (NER). The newly released model has a dimension of 200 and was trained from scratch on a mix of legal and general texts, allowing the model to learn usual English words, but with a special focus on domain-specific terms.
The architecture is based on Word2Vec, which provides the best tradeoff between representation capabilities and speed, allowing for fast and accurate models built on top of it (e.g., NER, Relation Extraction).
To use the model in Legal NLP, use the pretrained method of the corresponding annotator:
word_embedding_model = ( nlp.WordEmbeddings.pretrained( "legal_word_embeddings", "en", "legal/models" ) .setInputColumns(["sentence", "token"]) .setOutputColumn("word_embedding") )
With the new word embedding model, we trained new models and improved existing models:
- De-identification model:
legner_deid_le
- NER models:
legner_subpoenas_sm
,legner_sec_edgar_le
,legner_contract_doc_parties_le
These models achieved on average an improvement of 12% over the previous models by using the new word embedding model, with a small reduction in performance metrics for the contract doc parties model.
Sentence Embedding model
Now for the sentence embedding model, we based our model on the BGE architecture and trained from scratch on our own augmented Legal datasets together with public general English datasets. The model maps sentences/documents/paragraphs to a vector of 768 dimension and is designed to improve RAG applications.
To use the model, use the corresponding annotator:
sentence_embeddings = ( nlp.BGEEmbeddings.pretrained( "legal_bge_base_embeddings", "en", "legal/models" ) .setInputCols("document") .setOutputCol("sentence_embeddings") )
With better sentence embedding models, Legal practitioners, researchers, and developers can design better text classification, entity linking, and RAG systems.
New multilabel models
Using the previously release E5 embedding fine-tuned with Legal datasets, we trained improved versions of the following models:
- Contracts clause classification:
legmulticlf_edgar_le
- Agreement clause classification:
legmulticlf_sec_mnda_le
These models improved the previous versions up to 16%, demonstrating the better capacity of the E5 model for classification tasks.
Conclusion
Legal NLP 2.0 arrives with better capability to train specialized models based on performant embedding models. Practitioners and data scientists in this area can benefit from them. All models can be found in the Spark NLP Models Hub.
We are motivated to start releasing new specialized models in future releases, stay tuned.
Fancy trying?
We’ve got 30-day free licenses for you with technical support from our legal team of technical and SMEs. This trial includes complete access to more than 926 models, including Classification, NER, Relation Extraction, Similarity Search, Summarization, Sentiment Analysis, Question Answering, etc., and 120+ legal language models.
Just go to https://www.johnsnowlabs.com/install/ and follow the instructions!
Don’t forget to check our notebooks and demos.
How to run
Legal NLP is extremely easy to run on both clusters and driver-only environments using johnsnowlabs
library:
Install the johnsnowlabs library:
pip install johnsnowlabs
Then, in Python, install Legal NLP with:
from johnsnowlabs import nlp nlp.install(force_browser=True)
You are ready to use all the capabilities of the library. You can start a spar session and start analyzing legal texts.
# Start Spark Session spark = nlp.start()
For alternative installation methods of how to install in specific environments, please check the docs.
Try Legal NLP
See in action