mining legal texts with python

27
. . . . . . Mining Legal Text sil.fd SIL.fd . . . Information Mining and Visualization of a Large Volume of Legal Texts Flávio Codeço Coelho, Renato Rocha Souza and Pablo de Camargo Cerdeira Applied Mathematics School – Getulio Vargas Foundation August 22, 2011

Upload: flavio-coelho

Post on 30-Aug-2014

5.597 views

Category:

Education


3 download

DESCRIPTION

Presented at EuroScipy 2011

TRANSCRIPT

Page 1: Mining legal texts with Python

. . . . . .

Mining Legal Text

sil.fd SIL.fd .

.. ..

.

.

Information Mining and Visualization of a LargeVolume of Legal Texts

Flávio Codeço Coelho, Renato Rocha Souza and Pablo deCamargo Cerdeira

Applied Mathematics School – Getulio Vargas Foundation

August 22, 2011

Page 2: Mining legal texts with Python

. . . . . .

Mining Legal Text

Outline I.. .1 Introduction...2 Web-Scraping

HTML Parsing...3 Pattern Matching

Regular expressions...4 Database Interaction

MySQLDbSQLAlchemyMongoDb

...5 Natural Language ProcessingNLTK

...6 VisualizationMatplotlibUbigraphGource

Page 3: Mining legal texts with Python

. . . . . .

Mining Legal Text

Outline IIVisual Python

...7 Results

.. .8 Future Directions

Page 4: Mining legal texts with Python

. . . . . .

Mining Legal TextIntroduction

Conquering text

Scraping and indexing the world’s web pages has changed theworld...Should pagerank be our main measure of informationrelevance?What is possible if we go a little further?

Page 5: Mining legal texts with Python

. . . . . .

Mining Legal TextIntroduction

It’s documents all the way down...

Luckily, we didn’t have to scanthem...We have to conquer aninformation mountain...

Page 6: Mining legal texts with Python

. . . . . .

Mining Legal TextIntroduction

We had generous help...

Page 7: Mining legal texts with Python

. . . . . .

Mining Legal TextWeb-Scraping

Obtaining the Data

No API for access, a littleheuristics was necessaryScraping took more than 3months.1.3 million cases

Page 8: Mining legal texts with Python

. . . . . .

Mining Legal TextWeb-Scraping

Example: Photos

Navigating with Mechanize1

br = mechanize . Browser ()br . open ( ” http ://www. s t f . j u s . br/ p o r t a l / m in i s t ro / min i s t ro . asp? per iodo=s t f&t ipo=ant igu idade ” )i = 0l i n k = br . f i n d l i n k ( u r l r e g e x=r ’ ve rMin i s t ro . asp ’ , nr=i )wh i l e 1 :

br . f o l l o w l i n k ( l i n k )i l = br . f i n d l i n k ( u r l r e g e x=’ imagem . asp ’ )u r l = ” http ://www. s t f . j u s . br/ p o r t a l ”+ i l . u r l . s t r i p ( ’ . . ’ )nome = i l . t ex tdownload photo ( ur l , nome . decode ( ’ l a t i n 1 ’ ) . s p l i t ( ’ [ ’ ) [ 0 ] )br . back ()t r y :

l i n k = br . f i n d l i n k ( u r l r e g e x=r ’ ve rMin i s t ro . asp ’ , nr=i )i += 1

except LinkNotFoundError :break

1http://wwwsearch.sourceforge.net/mechanize/

Page 9: Mining legal texts with Python

. . . . . .

Mining Legal TextWeb-Scraping

HTML Parsing

Parsing scraped HTML

Beautiful Soup2 to the rescue!Firebug helped analyze page structure.Parsing was done during the scraping, to clean data forinsertion into MySQLSome parts of the page were stored in HTML for later parsing

sopa=Beaut i fu lSoup (d [ ’ dec i sao ’ ] . s t r i p ( ’ [ ] ’ ) , fromEncoding=’ ISO8859−1 ’ )r s = sopa . f i n d A l l ( ’ s t rong ’ , t ex t=re . compi le ( ’ˆ L e g i s l a ’ ) )

2http://www.crummy.com/software/BeautifulSoup/

Page 10: Mining legal texts with Python

. . . . . .

Mining Legal TextPattern Matching

Extracting Even more Information

With Data on Local db, we started mining it:Tried to use the best SQL and Python had to offerPattern matching, aggregation, string matching3, etc...

Read from Db → Process → Write to DbSQL → Python → SQL

3difflib

Page 11: Mining legal texts with Python

. . . . . .

Mining Legal TextPattern Matching

Regular expressions

Regular Expressions

re module, great, but tricky fordifferent encodings.Kodosa: visual debuggingindispensable!

ahttp://kodos.sourceforge.net/

rawst r = r ”””>∗\s ∗ ( [A−Z]{2 ,3}\ s∗−\s ∗ . [A−Z0−9]∗) |(CF) | ( ”CAPUT”)\ s+”””compi l e ob j = re . compi le ( rawstr , re .LOCALE)

Page 12: Mining legal texts with Python

. . . . . .

Mining Legal TextDatabase Interaction

Structuring the Data

.Goals..

.. ..

.

.

Reflect the original structure of the dataStore additional info coming from raw textDesign data model with future analytical needs in mind

Page 13: Mining legal texts with Python

. . . . . .

Mining Legal TextDatabase Interaction

MySQLDb

Databases and Drivers

MySQL (MariaDb4) was relational Db of ChoiceMySQLDb’s cursor.execute(’ select ∗ from ... ’)

Server side cursors were essential.MongoDb + PyMongo

4http://mariadb.org

Page 14: Mining legal texts with Python

. . . . . .

Mining Legal TextDatabase Interaction

SQLAlchemy

What about ORMs?

Object-relational mappers are great but...SqlAlchemy5 used mostly in table creation and data insertion.For analytical purposes, server-side raw SQL, stored procs andviews can’t be beaten.We mostly used Elixir to design the tables.

5http://www.sqlalchemy.org

Page 15: Mining legal texts with Python

. . . . . .

Mining Legal TextDatabase Interaction

MongoDb

Escaping from 2D dataBenefits:

Exploring MongoDba as analternative for AnalyticsAuto-sharding + Map/reduce!Escape costly Joins in MySQL

awww.mongodb.org

Tips:db.cursor( cursorclass=SSDictCursor)

Convert every string to UTF-8Pymongo’s transparentconversion of dictionaries toBSON

Page 16: Mining legal texts with Python

. . . . . .

Mining Legal TextNatural Language Processing

Understanding Text

Biggest challenge is extractingmeaning from decisionsIs a given decision pro- oragainst the defendant?What is the vote count onnon-unanimous decisions?

Page 17: Mining legal texts with Python

. . . . . .

Mining Legal TextNatural Language Processing

NLTK

Natural Language Toolkit

Lots of batteriesincluded

Page 18: Mining legal texts with Python

. . . . . .

Mining Legal TextVisualization

Visualizing the Data

You can’t ask questions about what you don’t know...Data driven research

Page 19: Mining legal texts with Python

. . . . . .

Mining Legal TextVisualization

Matplotlib

Standard Charting and Plotting: Matplotlib

Great for plotting summarystatisticsTogether with NetworkX canhelp visualizing some smallgraphs

Page 20: Mining legal texts with Python

. . . . . .

Mining Legal TextVisualization

Ubigraph

Large Graph Visualization: Ubigraph

Ubigraph Rocks!a

Navigating Huge graphs gavepowerful insightsTakes advantage of multiplecores and GPU

ahttp://ubietylab.net/ubigraph/

Page 21: Mining legal texts with Python

. . . . . .

Mining Legal TextVisualization

Gource

Untangling Temporal patterns:

A bit of Python to create logs compatible with Gource6

This:Q = dbdec . execute ( ”SELECT r e l a t o r , processo , t ipo , p roc c l a s s e , duracao , UF, data dec FROM dec i sao WHERE DATE FORMAT( data dec , ’%Y%’)=”+”%s ”%ano+” ORDER BY data dec asc ” )decs = Q. f e t c h a l l ( )dura t ions = [ d [ 4 ] f o r d in decs ]cmap = cm. j e tnorm = normal ize (min( dura t ions ) , max( dura t ions )) #normal i z ing dura t ionswith open ( ’ d e c i s o e s %s . log ’%ano , ’w ’ ) as f :

f o r d in decs :c = rgb2hex (cmap(norm(d [ 4 ] ) ) [ : 3 ] ) . s t r i p ( ’#’ )path = ”/%s/%s/%s/%s ”%(d [ 5 ] , d [ 2 ] , d [ 3 ] , d [ 1 ] ) #/ State / t i po / p r o c c l a s s e / processol = ”%s |%s |%s |%s |%s\n”%(i n t ( time . mktime(d [ 6 ] . t imetup le ( ) ) ) , d [ 0 ] , ’A ’ , path , c )f . wr i t e ( l )

Generates this:885967200|MIN. SYDNEY SANCHES |A|/MG/Monocrática/INQUÉRITO/1606809|0000a4885967200|MIN. SYDNEY SANCHES |A|/MG/ Pre s i dênc i a /INQUÉRITO/1606809|0000a4

6http://code.google.com/p/gource/

Page 22: Mining legal texts with Python

. . . . . .

Mining Legal TextVisualization

Gource

A snapshot of the Supreme Court activities: 1998

Page 23: Mining legal texts with Python

. . . . . .

Mining Legal TextVisualization

Gource

The Dynamics

Video

Page 24: Mining legal texts with Python

. . . . . .

Mining Legal TextVisualization

Visual Python

It’s a Jungle Out There. . .

Division of labor in the supremecourtVPythona is great to quicklycreate complex animations.Here judges are trees, branchesare subjects and leaves are legaldecisions

avpython.org

Page 25: Mining legal texts with Python

. . . . . .

Mining Legal TextResults

Results

Detailed X-ray of the innerworkings of the Supreme court92% of the cases are appeals ofa non-constitutional natureThese results led to the proposalof an amendment to theconstitution!More questions than answers!Python for data mining rocks!

Page 26: Mining legal texts with Python

. . . . . .

Mining Legal TextFuture Directions

To be continued...

Further automate and optimizeMore explorationsScale up the pipelineModel the life history of a legal process

Page 27: Mining legal texts with Python

. . . . . .

Mining Legal TextFuture Directions

Acknowledgements

FGV - Direito RioFGV - EMApBrazilian Supreme CourtAsla Sá (for kindly lending us her server)