Meetup NLP #2 season 4 Structuring legal …...Meetup NLP #2 season 4 Structuring legal documents...

Copyright © Doctrine

Meetup NLP #2 season 4

Structuring legal documents

with Deep Learning

Pauline Chavallard2019/11/27

Plan● About Doctrine and legal

research● Motivations● Modeling● Results● Further work

Google for law

Doctrine was created in 2016

Challenges

- volume of data

- heterogeneity

- domain specificity

Legal contents have tons of links

Challenges in data science at Doctrine

Low/weak supervision:

● No labeled data (esp. in French)

High specificity/heterogeneity:

● Language is different between decisions, legislations and commentaries

● Among decisions, depending on courts, structures are different

● Content comes in various formats (papers, images, PDFs, texts)

An example of French court decision


research

● Motivations● Modeling● Results● Further work

Motivation

● Four million court decisions delivered each year in France

● Critical information for lawyers

Problem:

● Long and complex documents

● One may be interested only in a very precise part

French court decisions

A french court decision is generally structured following these sections:

● Metadata (« En-tête » in French): court, number, date, etc., of the trial.

● Parties (« Parties » in French): information about the claimants and defendants

● Composition of the court (« Composition de la cour » in French)

● Facts (« Faits » in French): what happened?

● Pleas in law and main arguments (« Moyens » in French): arguments presented by

the claimant and defendant.

● Grounds (« Motifs » in French): reasons and arguments used by the court

● Operative part of the judgment (« Dispositif » in French): final decision

French court decisions - Example

Cour d'appel de Metz, 28 janvier 2015

https://www.doctrine.fr/d/CA/Metz/2015/RC5945DB875CE66D92213

French court decisions

Unfortunately, there is no mandatory guideline on how to

release a court decision.

Courts may use:

● different styles in term of writing

● different styles in term of organising the documents

● all sections from previous slide, or a subset

The French Court of Appeal usually has a very unified way of

writing: ~55 % have explicit titles for their categories

French Court of Appeal

Extracted from https://www.doctrine.fr/d/CA/Orleans/2007/SKDD824CCFE8D8D9D93128.

https://www.doctrine.fr/d/CA/Orleans/2007/SKDD824CCFE8D8D9D93128

The French Court of Appeal usually has a very unified way of

writing: ~55 % have explicit titles for their categories


Extracted from https://www.doctrine.fr/d/CA/Orleans/2007/SKDD824CCFE8D8D9D93128.

Facts

https://www.doctrine.fr/d/CA/Orleans/2007/SKDD824CCFE8D8D9D93128

For the remaining 45 %, it’s harder...


Extracted from https://www.doctrine.fr/d/CA/Metz/2015/RAC1261A1563690C06B77

https://www.doctrine.fr/d/CA/Metz/2015/RAC1261A1563690C06B77

How would an algorithm automatically generate table of contents ?


research● Motivations

● Modeling● Results● Further work

Information needed

To complete this task, a human being would take advantage of:

1. The vocabulary used

2. The order of the paragraphs

Information needed

1. The vocabulary used

Not always so obvious, legislation references in both...

-> standard approaches

(BoW - TF-IDF)

encodings performed

poorly

Information needed

1.

2. The order of the paragraphs

● Metadata

● Parties

● Composition of the court

● Facts

● Pleas in law and main arguments

● Grounds

● Operative part of the judgment

-> sequential

information is

important

Modeling

Split decisions into paragraphs (X)

Pre-process Replace rare words by <UNK> with p=0.5

Dataset creation

● Find labeled data from structured decisions with titles

● Remove titles

● Assign each paragraph to its corresponding label (y)

● y ∈ [0, 6]

-> Supervised classification

Looks like Named Entity Recognition... at paragraph scale.

With LSTM / CRF, we capture information from

● paragraph inherent properties

● paragraph context (the neighborhood gives insights on the label)

[1] Neural Architectures for Named Entity Recognition. Lample, Ballesteros, Subramanian, Kawakami, Dyer.

NAACL 2016.

Modeling

Modeling: paragraph embedding

Modeling: paragraph embedding

source: A structured self-attentive sentence embedding

https://arxiv.org/pdf/1703.03130.pdf

Modeling: all in one

Modeling: all in one

end-to-end training


research● Motivations● Modeling

● Results● Further work

Modeling: results

● Trained on 20.000 decisions

Modeling: results

bi-LSTM outperforms mean but is more computation hungry

Modeling: results

CRF outperforms softmax for the same computation cost

Modeling: results

Attention brings a few points for a low computation cost

Modeling: results

CRF enables to watch transition

probabilities:

● Each class followed by itself

● Metadata -> Parties

● Metadata -> Composition

● Low triangle part: green

● High triangle part: red

Modeling: attention

Product outcome

On the 45% incomplete table of contents of Court of

Appeal decisions, we now manage to get 90% complete

ones with this approach


research● Motivations● Modeling● Results

● Further work

Copyright © Doctrine 39

Further work

- better paragraphs / sentences splitting

- one of the tag is very rare, doesn’t perform well

- play with optimizers, dropout, …

- try different architectures ?

Blog post is available

Paragraph classification: article by Doctrine

https://medium.com/doctrine/structuring-legal-documents-with-deep-learning-4ad9b03fb19

Thank you for your attention!

Any questions ?

Meetup NLP #2 season 4 Structuring legal …...Meetup NLP #2 season 4 Structuring legal documents...

Documents

Transcript of Meetup NLP #2 season 4 Structuring legal …...Meetup NLP #2 season 4 Structuring legal documents...