mining legal texts with python
DESCRIPTION
Presented at EuroScipy 2011TRANSCRIPT
. . . . . .
Mining Legal Text
sil.fd SIL.fd .
.. ..
.
.
Information Mining and Visualization of a LargeVolume of Legal Texts
Flávio Codeço Coelho, Renato Rocha Souza and Pablo deCamargo Cerdeira
Applied Mathematics School – Getulio Vargas Foundation
August 22, 2011
. . . . . .
Mining Legal Text
Outline I.. .1 Introduction...2 Web-Scraping
HTML Parsing...3 Pattern Matching
Regular expressions...4 Database Interaction
MySQLDbSQLAlchemyMongoDb
...5 Natural Language ProcessingNLTK
...6 VisualizationMatplotlibUbigraphGource
. . . . . .
Mining Legal Text
Outline IIVisual Python
...7 Results
.. .8 Future Directions
. . . . . .
Mining Legal TextIntroduction
Conquering text
Scraping and indexing the world’s web pages has changed theworld...Should pagerank be our main measure of informationrelevance?What is possible if we go a little further?
. . . . . .
Mining Legal TextIntroduction
It’s documents all the way down...
Luckily, we didn’t have to scanthem...We have to conquer aninformation mountain...
. . . . . .
Mining Legal TextIntroduction
We had generous help...
. . . . . .
Mining Legal TextWeb-Scraping
Obtaining the Data
No API for access, a littleheuristics was necessaryScraping took more than 3months.1.3 million cases
. . . . . .
Mining Legal TextWeb-Scraping
Example: Photos
Navigating with Mechanize1
br = mechanize . Browser ()br . open ( ” http ://www. s t f . j u s . br/ p o r t a l / m in i s t ro / min i s t ro . asp? per iodo=s t f&t ipo=ant igu idade ” )i = 0l i n k = br . f i n d l i n k ( u r l r e g e x=r ’ ve rMin i s t ro . asp ’ , nr=i )wh i l e 1 :
br . f o l l o w l i n k ( l i n k )i l = br . f i n d l i n k ( u r l r e g e x=’ imagem . asp ’ )u r l = ” http ://www. s t f . j u s . br/ p o r t a l ”+ i l . u r l . s t r i p ( ’ . . ’ )nome = i l . t ex tdownload photo ( ur l , nome . decode ( ’ l a t i n 1 ’ ) . s p l i t ( ’ [ ’ ) [ 0 ] )br . back ()t r y :
l i n k = br . f i n d l i n k ( u r l r e g e x=r ’ ve rMin i s t ro . asp ’ , nr=i )i += 1
except LinkNotFoundError :break
1http://wwwsearch.sourceforge.net/mechanize/
. . . . . .
Mining Legal TextWeb-Scraping
HTML Parsing
Parsing scraped HTML
Beautiful Soup2 to the rescue!Firebug helped analyze page structure.Parsing was done during the scraping, to clean data forinsertion into MySQLSome parts of the page were stored in HTML for later parsing
sopa=Beaut i fu lSoup (d [ ’ dec i sao ’ ] . s t r i p ( ’ [ ] ’ ) , fromEncoding=’ ISO8859−1 ’ )r s = sopa . f i n d A l l ( ’ s t rong ’ , t ex t=re . compi le ( ’ˆ L e g i s l a ’ ) )
2http://www.crummy.com/software/BeautifulSoup/
. . . . . .
Mining Legal TextPattern Matching
Extracting Even more Information
With Data on Local db, we started mining it:Tried to use the best SQL and Python had to offerPattern matching, aggregation, string matching3, etc...
Read from Db → Process → Write to DbSQL → Python → SQL
3difflib
. . . . . .
Mining Legal TextPattern Matching
Regular expressions
Regular Expressions
re module, great, but tricky fordifferent encodings.Kodosa: visual debuggingindispensable!
ahttp://kodos.sourceforge.net/
rawst r = r ”””>∗\s ∗ ( [A−Z]{2 ,3}\ s∗−\s ∗ . [A−Z0−9]∗) |(CF) | ( ”CAPUT”)\ s+”””compi l e ob j = re . compi le ( rawstr , re .LOCALE)
. . . . . .
Mining Legal TextDatabase Interaction
Structuring the Data
.Goals..
.. ..
.
.
Reflect the original structure of the dataStore additional info coming from raw textDesign data model with future analytical needs in mind
. . . . . .
Mining Legal TextDatabase Interaction
MySQLDb
Databases and Drivers
MySQL (MariaDb4) was relational Db of ChoiceMySQLDb’s cursor.execute(’ select ∗ from ... ’)
Server side cursors were essential.MongoDb + PyMongo
4http://mariadb.org
. . . . . .
Mining Legal TextDatabase Interaction
SQLAlchemy
What about ORMs?
Object-relational mappers are great but...SqlAlchemy5 used mostly in table creation and data insertion.For analytical purposes, server-side raw SQL, stored procs andviews can’t be beaten.We mostly used Elixir to design the tables.
5http://www.sqlalchemy.org
. . . . . .
Mining Legal TextDatabase Interaction
MongoDb
Escaping from 2D dataBenefits:
Exploring MongoDba as analternative for AnalyticsAuto-sharding + Map/reduce!Escape costly Joins in MySQL
awww.mongodb.org
Tips:db.cursor( cursorclass=SSDictCursor)
Convert every string to UTF-8Pymongo’s transparentconversion of dictionaries toBSON
. . . . . .
Mining Legal TextNatural Language Processing
Understanding Text
Biggest challenge is extractingmeaning from decisionsIs a given decision pro- oragainst the defendant?What is the vote count onnon-unanimous decisions?
. . . . . .
Mining Legal TextNatural Language Processing
NLTK
Natural Language Toolkit
Lots of batteriesincluded
. . . . . .
Mining Legal TextVisualization
Visualizing the Data
You can’t ask questions about what you don’t know...Data driven research
. . . . . .
Mining Legal TextVisualization
Matplotlib
Standard Charting and Plotting: Matplotlib
Great for plotting summarystatisticsTogether with NetworkX canhelp visualizing some smallgraphs
. . . . . .
Mining Legal TextVisualization
Ubigraph
Large Graph Visualization: Ubigraph
Ubigraph Rocks!a
Navigating Huge graphs gavepowerful insightsTakes advantage of multiplecores and GPU
ahttp://ubietylab.net/ubigraph/
. . . . . .
Mining Legal TextVisualization
Gource
Untangling Temporal patterns:
A bit of Python to create logs compatible with Gource6
This:Q = dbdec . execute ( ”SELECT r e l a t o r , processo , t ipo , p roc c l a s s e , duracao , UF, data dec FROM dec i sao WHERE DATE FORMAT( data dec , ’%Y%’)=”+”%s ”%ano+” ORDER BY data dec asc ” )decs = Q. f e t c h a l l ( )dura t ions = [ d [ 4 ] f o r d in decs ]cmap = cm. j e tnorm = normal ize (min( dura t ions ) , max( dura t ions )) #normal i z ing dura t ionswith open ( ’ d e c i s o e s %s . log ’%ano , ’w ’ ) as f :
f o r d in decs :c = rgb2hex (cmap(norm(d [ 4 ] ) ) [ : 3 ] ) . s t r i p ( ’#’ )path = ”/%s/%s/%s/%s ”%(d [ 5 ] , d [ 2 ] , d [ 3 ] , d [ 1 ] ) #/ State / t i po / p r o c c l a s s e / processol = ”%s |%s |%s |%s |%s\n”%(i n t ( time . mktime(d [ 6 ] . t imetup le ( ) ) ) , d [ 0 ] , ’A ’ , path , c )f . wr i t e ( l )
Generates this:885967200|MIN. SYDNEY SANCHES |A|/MG/Monocrática/INQUÉRITO/1606809|0000a4885967200|MIN. SYDNEY SANCHES |A|/MG/ Pre s i dênc i a /INQUÉRITO/1606809|0000a4
6http://code.google.com/p/gource/
. . . . . .
Mining Legal TextVisualization
Gource
A snapshot of the Supreme Court activities: 1998
. . . . . .
Mining Legal TextVisualization
Gource
The Dynamics
Video
. . . . . .
Mining Legal TextVisualization
Visual Python
It’s a Jungle Out There. . .
Division of labor in the supremecourtVPythona is great to quicklycreate complex animations.Here judges are trees, branchesare subjects and leaves are legaldecisions
avpython.org
. . . . . .
Mining Legal TextResults
Results
Detailed X-ray of the innerworkings of the Supreme court92% of the cases are appeals ofa non-constitutional natureThese results led to the proposalof an amendment to theconstitution!More questions than answers!Python for data mining rocks!
. . . . . .
Mining Legal TextFuture Directions
To be continued...
Further automate and optimizeMore explorationsScale up the pipelineModel the life history of a legal process
. . . . . .
Mining Legal TextFuture Directions
Acknowledgements
FGV - Direito RioFGV - EMApBrazilian Supreme CourtAsla Sá (for kindly lending us her server)