Bringing fresh capabilities for Legal research and applications with Legal NLP

01.07.2024

David Cecchini

Data Scientist at John Snow Labs

Redesign of embedding models

Recent developments in NLP rely on vector representations of text, commonly known as embeddings. To support the utilization, training, and fine-tuning of models for the legal domain, Legal NLP is introducing new models: word embedding models, which generate vector representations of words (or tokens), and sentence embedding models, which create vector representations for longer pieces of text such as sentences, paragraphs, and documents.

Word Embedding model

For word embedding models, we aimed to create fast and lightweight models that can be used on token classification tasks such as Named Entity Recognition (NER). The newly released model has a dimension of 200 and was trained from scratch on a mix of legal and general texts, allowing the model to learn usual English words, but with a special focus on domain-specific terms.

The architecture is based on Word2Vec, which provides the best tradeoff between representation capabilities and speed, allowing for fast and accurate models built on top of it (e.g., NER, Relation Extraction).

To use the model in Legal NLP, use the pretrained method of the corresponding annotator:

word_embedding_model = (
    nlp.WordEmbeddings.pretrained(
        "legal_word_embeddings", "en", "legal/models"
    )
    .setInputColumns(["sentence", "token"])
    .setOutputColumn("word_embedding")
)

With the new word embedding model, we trained new models and improved existing models:

De-identification model: legner_deid_le
NER models: legner_subpoenas_sm, legner_sec_edgar_le, legner_contract_doc_parties_le

These models achieved on average an improvement of 12% over the previous models by using the new word embedding model, with a small reduction in performance metrics for the contract doc parties model.

Sentence Embedding model

Now for the sentence embedding model, we based our model on the BGE architecture and trained from scratch on our own augmented Legal datasets together with public general English datasets. The model maps sentences/documents/paragraphs to a vector of 768 dimension and is designed to improve RAG applications.

To use the model, use the corresponding annotator:

sentence_embeddings = (
    nlp.BGEEmbeddings.pretrained(
        "legal_bge_base_embeddings", "en", "legal/models"
    )
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")
)

With better sentence embedding models, Legal practitioners, researchers, and developers can design better text classification, entity linking, and RAG systems.

New multilabel models

Using the previously release E5 embedding fine-tuned with Legal datasets, we trained improved versions of the following models:

Contracts clause classification: legmulticlf_edgar_le
Agreement clause classification: legmulticlf_sec_mnda_le

These models improved the previous versions up to 16%, demonstrating the better capacity of the E5 model for classification tasks.

Conclusion

Legal NLP 2.0 arrives with better capability to train specialized models based on performant embedding models. Practitioners and data scientists in this area can benefit from them. All models can be found in the Spark NLP Models Hub.

We are motivated to start releasing new specialized models in future releases, stay tuned.

Fancy trying?

We’ve got 30-day free licenses for you with technical support from our legal team of technical and SMEs. This trial includes complete access to more than 926 models, including Classification, NER, Relation Extraction, Similarity Search, Summarization, Sentiment Analysis, Question Answering, etc., and 120+ legal language models.

Just go to https://www.johnsnowlabs.com/install/ and follow the instructions!

Don’t forget to check our notebooks and demos.

How to run

Legal NLP is extremely easy to run on both clusters and driver-only environments using johnsnowlabs library:

Install the johnsnowlabs library:

pip install johnsnowlabs

Then, in Python, install Legal NLP with:

from johnsnowlabs import nlp

nlp.install(force_browser=True)

You are ready to use all the capabilities of the library. You can start a spar session and start analyzing legal texts.

# Start Spark Session
spark = nlp.start()

For alternative installation methods of how to install in specific environments, please check the docs.

Try Legal NLP

See in action

David Cecchini

Data Scientist at John Snow Labs

Our additional expert:

Ph.D. at Tsinghua-Berkeley Shenzhen Institute | Data Scientist

Trigent and John Snow Labs Unveil AI Accelerator (Trigent AXLR8 Labs), Poised to Revolutionize Healthcare and Legal Landscapes

Ida Lucente

Trigent, a leading US-based technology services provider and John Snow Labs, a trailblazer in AI and NLP for healthcare, proudly announce the...