neo4j partner tag berlin - investigating the panama papers connections with neo4j

35
Inves&ga&ng the #PanamaPapers Connec&ons with Neo4j PartnerDay Berlin Stefan Kolmar Director Field Engineering

Upload: neo4j-the-fastest-and-most-scalable-native-graph-database

Post on 11-Apr-2017

46 views

Category:

Technology


1 download

TRANSCRIPT

Inves&ga&ngthe#PanamaPapersConnec&onswithNeo4j

PartnerDayBerlinStefanKolmarDirectorFieldEngineering

SourceMaterial

takenfrom•  theICIJpresenta1on•  theRedditAMA•  onlinepublica1ons(SZ,Guardian,TNWet.al.)•  theICIJwebsite

•  hFps://panamapapers.icij.org/•  ThePowerPlayers•  KeyNumbers&Figures

+190 journalists in more than 65 countries

12 staff members (USA, Costa Rica, Venezuela, Germany, France, Spain) 50% of the team = Data & Research Unit

raw files

metadata author; sender...

database

search and discovery

raw text

3 million files x

10 seconds per file =

347 Days

Inves1gatorsusedNuix’sop1calcharacterrecogni1ontomakemillionsofscanneddocumentstext-searchable.TheyusedNuix’snameden1tyextrac1onandotheranaly1caltoolstoiden1fyandcross-referencethenamesofMossackFonsecaclientsthroughmillionsofdocuments.

Lucene syntax queries with proximity matching!

400 users

Unstructureddataextrac1on●  NuixprofessionalOCRservice●  ICIJExtract(opensource,Java:hFps://github.com/ICIJ/extract),leveragesApacheTika,TesseractOCRandJBIG2-ImageIO.

Structureddataextrac1on●  AbunchofPython

Database●  ApacheSolr(opensource,Java)●  Redis(opensource,C)● Neo4j(opensource,Java)

App●  Blacklight(opensource,Rails)●  Linkurious(closedsource,JS)

Stack

ContextisKing name:“John”last:„Miller“role:„Nego1ator“

name:"Maria"last:"Osara"name:“SomeMediaLtd”

value:“$70M”

PERSON

PERSON

PERSON

PERSON

name:”Jose"last:“Pereia“posi1on:“Governor“

name:“Alice”last:„Smith“role:„Advisor“

ContextisKing

SENT

SUPPORTS

CREATED

MENTIONS

name:“John”last:„Miller“role:„Nego1ator“

name:"Maria"last:"Osara"

since:Jan10,2011

name:“SomeMediaLtd”value:“$70M”

PERSON

PERSON

WRO

TE

PERSON

PERSON

name:”Jose"last:“Pereia“posi1on:“Governor“

name:“Alice”last:„Smith“role:„Advisor“

Theworldisagraph–everythingisconnected

•  people,places,events•  companies,markets•  countries,history,poli1cs•  sciences,art,teaching•  technology,networks,machines,applica1ons,users

•  sodware,code,dependencies,architecture,deployments

•  criminals,fraudstersandtheirbehavior

NODE

key:“value”proper1es

PropertyGraphModel

Nodes•  Theen11esinthegraph•  Canhavename-valueproper%es•  CanbelabeledRela&onships•  Relatenodesbytypeanddirec1on•  Canhavename-valueproper%es

RELATIONSHIPNODE NODE

key:“value”proper1es

key:“value”proper1es

key:“value”proper1es

YourfriendNeo4j

Anopen-sourcegraphdatabase•  Manageandstoreyourconnecteddataasagraph

•  Queryrela&onshipseasilyandquickly

•  Evolvemodelandapplica&onstosupportnewrequirementsandinsights

•  Builttosolverela&onalpains

ValuefromDataRela&onshipsCommonUseCases

InternalApplica&onsMasterDataManagement

NetworkandITOpera1ons

FraudDetec&on

Customer-FacingApplica&onsReal-TimeRecommenda1ons

Graph-BasedSearchIden1tyand

AccessManagement

hTp://neo4j.com/use-cases

WhiteboardtoGraph

Neo4j:AllaboutPaTerns

(:Person{name:"Dan"})-[:KNOWS]->(:Person{name:"Ann"})

KNOWS

Dan Ann

NODE NODE

LABEL PROPERTY

hTp://neo4j.com/developer/cypher

LABEL PROPERTY

Cypher:FindPaTerns

MATCH(:Person{name:"Dan"})-[:KNOWS]->(who:Person)RETURNwho

KNOWS

Dan ???

LABEL

NODE NODE

LABEL PROPERTY ALIAS ALIAS

hTp://neo4j.com/developer/cypher

Ge]ngDataintoNeo4j

Cypher-Based“LOADCSV”•  Transac1onal(ACID)writes•  Ini1alandincrementalloadsofupto10millionnodesandrela1onships

,,,

LOADCSVWITHHEADERSFROM"url"ASrowMERGE(:Person{name:row.name,age:toInt(row.age)});

Ge]ngDataintoNeo4j

LoadJSONwithCypher•  LoadJSONviaprocedure•  Deconstructthedocument•  Intoanon-duplicatedgraphmodel

{}{}{}

CALLapoc.load.json("url")yieldvalueasdocUNWINDdoc.itemsasitemMERGE(:Contract{title:item.title,amount:toFloat(item.amount)});

Ge]ngDataintoNeo4j

CSVBulkLoaderneo4j-import•  Forini1aldatabasepopula1on•  Forloadswith10B+records•  Upto1Mrecordspersecond

,,,,,,,,,

bin/neo4j-import–-intopeople.db--nodes:Personpeople.csv--nodes:Companycompanies.csv--relationship:STAKEHOLDERstakeholders.csv

TheStepsInvolvedintheDocumentAnalysis

1.   Acquiredocuments2.   Classifydocuments

•  Scan/OCR•  Extractdocumentmetadata

3.  Whiteboarddomainandques&ons,determine•  en&&esandtheirrela&onships•  poten1alen1tyandrela1onshipproper&es•  sourcesforthoseen11esandtheirproper1es

TheStepsInvolvedintheDocumentAnalysis

4.  Developanalyzers,rules,parsersandnameden1tyrecogni1on

5.  Parseandstoremetadata,documentanden1tyrela1onships

•  Parsebyauthor,nameden11es,dates,sourcesandclassifica1ons

6.  Inferen1tyrela1onships

7.  Computesimilari1es,transi1vecoverandtriangles

8.  Analyzedatausinggraphqueriesandvisualiza1ons

WeneedaDataModel

MetaDataEn&&es•  Document,Email,Contract,DB-Record

•  Meta:Author,Date,Source,Keywords

•  Conversa1on:Sender,Receiver,Topic

•  MoneyFlows

ActualEn&&es•  Person•  Representa1ve(Officer)•  Address•  Client•  Company•  Account

Eitherbasedonourusecases&ques1onsOntheen11espresentinourmeta-dataanddata.

DataModel–Rela&onships

Meta-Data•  sent,received,cc‘ed•  men1oned,topic-of•  created,signed•  aFached•  roles•  familyrela1onships

Ac&vi&es•  openaccount•  manage•  hasshares•  registeredaddress•  moneyflow

TheICIJDataModel

TheICIJDataModel

•  Simplis1cDatamodelwith4En11esand5Rela1onships•  Weonlyknowthepublishedmodel•  Missing

•  Documents,Metadata•  FamilyRela1onships•  Connec1onstoPublicRecordDatabases

•  ContainsDuplicates•  Rela1onshipinforma1onstoredonen11es•  Couldusericherlabeling

ExampleDataset-Azerbaijan’sPresidentIlhamAliyev

•  wasalreadypreviouslyinves1gated•  wholefamilyinvolved•  differentshellcompanies&involvements

hFp://neo4j.com/graphgist/ec65c2fa-9d83-4894-bc1e-98c475c7b57a

BasedOn:hFp://neo4j.com/blog/analyzing-panama-papers-neo4j/