quantifying rdf data sets

21
Quantifying RDF data sets (a start) Janos G. Hajagos Stony Brook University School of Medicine 1

Upload: janos-hajagos

Post on 11-May-2015

656 views

Category:

Technology


1 download

DESCRIPTION

The semantic Web is built on the Resource Description Framework (RDF). RDF is a graph model. It would be expected that a wide range of network analytical tools could be directly applied to a RDF data set. However, most network algorithms assume that a graph does not have parallel edges which the RDF graph model allows. Two approaches will be examined: direct measures of RDF graph structure using ratios and extraction of graphs from an RDF data set. Py-Triple-Simple (http://code.google.com/p/py-triple-simple/), an experimental pure Python library, can extract “well behaved” graphs from an N-triples file and can quantify RDF graph structure using ratios.

TRANSCRIPT

Page 1: Quantifying RDF data sets

1

Quantifying RDF data sets(a start)

Janos G. HajagosStony Brook UniversitySchool of Medicine

Page 2: Quantifying RDF data sets

2

Resource Description Framework

Graph based data model:– Vertices or nodes are identified by URIs

<http://dbpedia.org/resource/Aspirin>– Vertices can be typed: rdf:type– Directed edges or links are specified with URIs– Parallel edges are allowed (multi-graph)– Literals are properties of vertices

Page 3: Quantifying RDF data sets

3

http://challenge.semanticweb.org/submissions/swc2010_submission_15.pdf

Page 4: Quantifying RDF data sets

4

• Pure Python library• In-memory only• PyPy JIT for speed• API for pattern matching

• No SPARQL support• Ignores types• No named graphs• No http access

Page 5: Quantifying RDF data sets

5

Counting: 1, 2, 3, . . .

• Number of triples (Nt)• Number of literals (Nl)• Number of object URIs (No)• Number of distinct literals (type removed) (Ndl)• Number of distinct objects (Ndo)• Number of distinct subjects (Nds)• Number of distinct URIs (Nu)• Number of typed instances (Ni)• Number of instances of type t (Nit)• Number of distinct classes (Nc)• Number of distinct predicates (Ndp)

Page 6: Quantifying RDF data sets

6

Simple fractions

“Literalness” = Nl / Nt “Literal uniqueness” = Ndl / Nl “Object uniqueness” = Ndo / No“Structure” = 1 - (Ni + Nl) / Nt“Subject coverage” = Nds / Nu“Object coverage” = Ndo / Nu“Type frequency of class t” = {Nit / Ni , . . .}

Page 8: Quantifying RDF data sets

8

Linked CTStatistics:Number of triples (Nt): 27,965,909Number of literals (Nl): 11,153,086Number of objects (No): 16,812,823Number of typed instances (Ni): 3,033,501Number of URIs excluding predicates (Nu): 3,269,681Number of distinct classes (Nc): 30Number of distinct subjects (Nds): 3,033,495Number of distinct predicates (Ndp): 123Number of distinct objects (Ndo): 3,148,210Number of distinct literals (Ndl): 5,496,593Number of distinct lexical symbols (Ndls): 8,621,986

Literalness (Nl/Nt): 0.399Literal uniqueness (Ndl/Nl): 0.493Object uniqueness (Ndo/No): 0.187Structure (1 - (Nl+Ni)/Nt): 0.492Subject coverage (Nds/Nu): 0.927Object coverage (Ndo/Nu): 0.962Class coverage: [0.15, 0.13, 0.12, 0.08, 0.05, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.03, 0.03, 0.03, 0.03, 0.02, 0.01, 0.01, 0.009, 0.008, 0.007, 0.007, 0.006, 0.002, 0.002, 0.001, 6.0e-05, 4.0e-05, 9.2e-06, 6.6e-07]

Top 5 subjects:<http://data.linkedct.org/resource/country/united-states>, 60,980 <http://data.linkedct.org/resource/state/california>, 15,775<http://data.linkedct.org/resource/state/texas>, 13,264<http://data.linkedct.org/resource/state/new-york>, 13,172<http://data.linkedct.org/resource/oversight_info/7eb3d38adc47e7e583ab6031fe2948ba>, 11,963

Top 5 objects including literals:"No", 525,210<http://data.linkedct.org/vocab/resource/location>, 477,926<http://data.linkedct.org/vocab/resource/facility>, 387,542<http://data.linkedct.org/vocab/resource/outcome>, 376,231<http://data.linkedct.org/vocab/resource/external_linkage>, 271,431<http://data.linkedct.org/resource/linkage_method/standardized-string-matching>, 185,902

Top 5 predicates:<http://data.linkedct.org/vocab/resource/has_provenance>, 7,482,352<http://www.w3.org/2000/01/rdf-schema#label>, 3,142,207<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 3,033,501<http://data.linkedct.org/vocab/resource/trial_location>, 982,202<http://data.linkedct.org/vocab/resource/location_facility>, 477,923

Page 9: Quantifying RDF data sets

9

BioGrid in BioPaxStatistics:Number of triples (Nt): 14,326,621Number of literals (Nl): 5,680,921Number of objects (No): 8,645,700Number of typed instances (Ni): 4,229,345Number of URIs excluding predicates (Nu): 4,229,358Number of distinct classes (Nc): 12Number of distinct subjects (Nds): 4,229,345Number of distinct predicates (Ndp): 23Number of distinct objects (Ndo): 4,009,607Number of distinct literals (Ndl): 1,145,973Number of distinct lexical symbols (Ndls): 5,375,354

Literalness (Nl/Nt): 0.400Literal uniqueness (Ndl/Nl): 0.202Object uniqueness (Ndo/No): 0.464Structure (1 - (Nl+Ni)/Nt): 0.309Subject coverage (Nds/Nu): 0.999Object coverage (Ndo/Nu): 0.948Class coverage: [0.295, 0.156, 0.104, 0.104, 0.104, 0.066, 0.052, 0.052, 0.052, 0.007, 0.007, 2.3e-07]

Top 5 subjects:<http://cbio.mskcc.org/cpath#CPATH-716194>, 470<http://cbio.mskcc.org/cpath#CPATH-156001>, 362<http://cbio.mskcc.org/cpath#CPATH-738240>, 292<http://cbio.mskcc.org/cpath#CPATH-818091>, 266,<http://cbio.mskcc.org/cpath#CPATH-726044>, 229

Top 5 objects including literals:<http://www.biopax.org/release/biopax-level2.owl#unificationXref>, 1,249,232 <http://www.biopax.org/release/biopax-level2.owl#openControlledVocabulary>, 659,251 "PSI-MI", 659,250"PUBMED", 439,528<http://www.biopax.org/release/biopax-level2.owl#publicationXref>, 439,528

Top 5 predicates:<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 4,229,345<http://www.biopax.org/release/biopax-level2.owl#DB>, 1,966,356<http://www.biopax.org/release/biopax-level2.owl#ID>, 1,966,356<http://www.biopax.org/release/biopax-level2.owl#XREF>, 1,933,616<http://www.biopax.org/release/biopax-level2.owl#TERM>, 659,251

Page 10: Quantifying RDF data sets

10

RxNormStatistics:Number of triples (Nt): 9,169,907Number of literals (Nl): 4,557,110Number of objects (No): 4,612,797Number of typed instances (Ni): 628,852Number of URIs excluding predicates (Nu): 808,979Number of distinct classes (Nc): 6Number of distinct subjects (Nds): 807,722Number of distinct predicates (Ndp): 193Number of distinct objects (Ndo): 471,847Number of distinct literals (Ndl): 2,577,006Number of distinct lexical symbols (Ndls): 3,385,997

Literalness (Nl/Nt): 0.497Literal uniqueness (Ndl/Nl): 0.565Object uniqueness (Ndo/No): 0.102Structure (1 - (Nl+Ni)/Nt): 0.434Subject coverage (Nds/Nu): 0.998Object coverage (Ndo/Nu): 0.583Class coverage: [0.748, 0.252, 0.0003, 5. 6e-05, 9.5e-06, 6.360e-06]

Top 5 subjects:<http://link.informatics.stonybrook.edu/rxnorm/RXCUI/317541>, 11,804<http://link.informatics.stonybrook.edu/rxnorm/RXAUI/3149147>, 9,943<http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316949>, 8,668<http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316968>, 6,464<http://link.informatics.stonybrook.edu/rxnorm/RXCUI/316965>, 4,605

Top 5 objects including literals:<http://link.informatics.stonybrook.edu/rxnorm/RXAUI>, 470,170 <http://link.informatics.stonybrook.edu/rxnorm/RXCUI>, 158,457<http://link.informatics.stonybrook.edu/rxnorm/SAB/RXNORM> 143,622 <http://link.informatics.stonybrook.edu/rxnorm/SAB/NDFRT>, 134,049<http://link.informatics.stonybrook.edu/rxnorm/TTY/CD>, 101,246

Top 5 predicates:<http://www.w3.org/2000/01/rdf-schema#label>, 807,705<http://link.informatics.stonybrook.edu/rxnorm/ATN#NDC>, 634,124<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 628,852<http://link.informatics.stonybrook.edu/rxnorm/REL#has_related_form>, 571,320<http://link.informatics.stonybrook.edu/umls/hasCUI>, 507,950

Page 11: Quantifying RDF data sets

11

SUNY Reach in VIVOStatistics:Number of triples (Nt): 1,278,216Number of literals (Nl): 562,262Number of objects (No): 715,954Number of typed instances (Ni): 243,263Number of URIs excluding predicates (Nu): 174,488Number of distinct classes (Nc): 71Number of distinct subjects (Nds): 161,459Number of distinct predicates (Ndp): 109Number of distinct objects (Ndo): 172,991Number of distinct literals (Ndl): 224,290Number of distinct lexical symbols (Ndls): 398,887

Literalness (Nl/Nt): 0.440Literal uniqueness (Ndl/Nl): 0.399Object uniqueness (Ndo/No): 0.241Structure (1 - (Nl+Ni)/Nt): 0.369Subject coverage (Nds/Nu): 0.925Object coverage (Ndo/Nu): 0.991Class coverage: [0.391, 0.132, 0.128, 0.083, 0.075, 0.040, 0.037, 0.017,. . .]

Top 5 subjects:<http://reach.suny.edu/individual/team_1>, 599<http://reach.suny.edu/individual/Faraone_Stephen>, 404<http://reach.suny.edu/individual/Hopkins_L>, 298<http://reach.suny.edu/individual/Genco_Robert>, 272<http://reach.suny.edu/individual/Jusko_William>, 257

Top 5 objects including literals:<http://vivoweb.org/ontology/core#Authorship>, 95,303 <http://xmlns.com/foaf/0.1/Person>, 32,040<http://reach.suny.edu/ontology/core#Other_Investigator>, 31,170<http://vivoweb.org/ontology/core#Relationship>, 20,176<http://vivoweb.org/ontology/core#InformationResource>, 18,301

Top 5 predicates:<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 243,263<http://vivoweb.org/ontology/core#freetextKeyword>, 199,327<http://www.w3.org/2000/01/rdf-schema#label>, 144,653<http://vivoweb.org/ontology/core#informationResourceInAuthorship>, 95,105<http://vivoweb.org/ontology/core#authorInAuthorship>, 95,101

Page 12: Quantifying RDF data sets

12

DrugBankStatistics:Number of triples (Nt): 766,920Number of literals (Nl): 494,028Number of objects (No): 272,892Number of typed instances (Ni): 24,522Number of URIs excluding predicates (Nu): 103,847Number of distinct classes (Nc): 8Number of distinct subjects (Nds): 19,693Number of distinct predicates (Ndp): 119Number of distinct objects (Ndo): 89,685Number of distinct literals (Ndl): 186,457Number of distinct lexical symbols (Ndls): 290,307

Literalness (Nl/Nt): 0.644Literal uniqueness (Ndl/Nl): 0.377Object uniqueness (Ndo/No): 0.329Structure (1 - (Nl+Ni)/Nt): 0.324Subject coverage (Nds/Nu): 0.190Object coverage (Ndo/Nu): 0.863Class coverage: [0.41, 0.20, 0.20, 0.19, 0.004, 0.004, 0.002, 0.0002]

Top 5 subjects: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/587>, 3767<http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/3722>, 3032<http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/357>, 2780<http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/146>, 2570<http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/136>, 2504

Top 5 objects including literals:<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/drug_interactions>,10,153"physiological process", 8,001<http://www4.wiwiss.fu-berlin.de/drugbank/resource/references/17016423>, 7,191<http://www4.wiwiss.fu-berlin.de/drugbank/resource/references/17139284>, 7,191),"catalytic activity", 6,841

Top 5 predicates:<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/generalReference>, 72,359<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/goClassificationFunction>, 72,232 <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/goClassificationProcess>, 63,520<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/synonym>, 44,949<http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/cellularLocation>, 26,258

Page 13: Quantifying RDF data sets

13

DailyMedStatistics:Number of triples (Nt): 164,276Number of literals (Nl): 59,885Number of objects (No): 104,391Number of typed instances (Ni): 14,934Number of URIs excluding predicates (Nu): 22,365Number of distinct classes (Nc): 6Number of distinct subjects (Nds): 10,015Number of distinct predicates (Ndp): 28Number of distinct objects (Ndo): 21,968Number of distinct literals (Ndl): 45,814Number of distinct lexical symbols (Ndls): 68,181

Literalness (Nl/Nt): 0.364Literal uniqueness (Ndl/Nl): 0.765Object uniqueness (Ndo/No): 0.210Structure (1 - (Nl+Ni)/Nt): 0.544Subject coverage (Nds/Nu): 0.448Object coverage (Ndo/Nu): 0.982Class coverage: [0.37, 0.29, 0.29, 0.05, 0.002, 0.0003]

Top 5 subjects:<http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/2245>, 240<http://www4.wiwiss.fu-berlin.de/dailymed/resource/organization/Hospira,_Inc.>, 216<http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/2019>, 200<http://www4.wiwiss.fu-berlin.de/dailymed/resource/organization/Teva_Pharmaceuticals_USA, 193 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/drugs/3505>, 170

Top 5 objects including literals:<http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/ingredients>, 5,577 http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/drugs>, 4,308<http://www4.wiwiss.fu-berlin.de/drugbank/vocab/resource/class/Offer>, 4308<http://www4.wiwiss.fu-berlin.de/dailymed/resource/routeOfAdministration/Oral>, 2,465<http://www4.wiwiss.fu-berlin.de/dailymed/resource/ingredient/magnesium_stearate>, 1,405

Top 5 predicates:<http://www.w3.org/2002/07/owl#sameAs>, 31,929 <http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/inactiveIngredient>, 28,403<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>, 14,934<http://www.w3.org/2000/01/rdf-schema#label>, 10,596<http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/possibleDiseaseTarget>, 6,124<http://www4.wiwiss.fu-berlin.de/dailymed/resource/dailymed/routeOfAdministration>, 4,308

Page 14: Quantifying RDF data sets

14

Building a co-author network from VIVO

with a twist

Page 15: Quantifying RDF data sets

15

VIVO ontology modeling of authorship

The twist is to include only members of the Reach site

Page 16: Quantifying RDF data sets

16

Graph processing and extraction

• Follow – Multiple linked steps are allowed

• Collapse parallel edges– Add weight to edges based on

on counts• Export – Standard graph format like GraphML, an XML format for

graph exchange

Page 17: Quantifying RDF data sets

17

Network analysis with NetworkX

Page 18: Quantifying RDF data sets

18

Network analysis with Mathematica

Page 19: Quantifying RDF data sets

19

Network visualization with Gephi

Page 20: Quantifying RDF data sets

20

For Your Information- Linked CT: http://queens.db.toronto.edu/~oktie/linkedct/- BioGrid in PAX: http://www.pathwaycommons.org/pc-snapshot/current-release/biopax/by_source/- Drugbank: http://www4.wiwiss.fu-berlin.de/drugbank/drugbank_dump.nt- DailyMed: http://www4.wiwiss.fu-berlin.de/dailymed/dailymed_dump.nt- RxNorm is available at:http://link.informatics.stonybrook.edu/rxnorm/- Reach VIVO site is at: http://reach.sunysb.eduSPARQL endpoint: http://link.informatics.stonybrook.edu/sparql/ named graph http://reach.sunysb.edu

Page 21: Quantifying RDF data sets

21

The End

http://ctsaconnect.org/