symplectic.co.uk vivo isf: investigating speed factors graham triggs head of repository systems...

19
symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems [email protected] @grahamtriggs

Upload: antonia-howard

Post on 01-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

VIVO ISF:Investigating Speed Factors

Graham TriggsHead of Repository Systems

[email protected]

@grahamtriggs

Page 2: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

About the title..

versus

pre-ISF

This is not

VIVO-ISF

Page 3: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

This is..

Practical use of VIVO 1.8

Challenges encountered

Solutions and suggestions

Page 4: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

Loading Data

Page 5: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

Demo Client #1 Client #2

Users 136 27,489 5,544

External Co-authors ~46,000 ~120,000 ~140,000

Articles ~36,000 ~110,000 ~150,000

Events ~8,000

Asserted Triples 6,683,071 12,372,999

Inferred Triples 6,848,955 12,236,798

Total Triples 13,532,026 24,609,797

Datasets

Page 6: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

r3.large

- optimized for memory-intensive applications• 2 vCPU (Intel Xeon E5-2670 v2 Ivy Bridge)• 15.25 GiB memory• 32 GB SSD instance storage• added 50 GB SSD general purpose (gp2) storage

Demo Server

Page 7: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

24 hours – data still not loaded

Unreserved SSD = limited IO by size

Small disks = low IO

(AWS GP2 = max 128 MiBs rising to 160. 3 IOPs per GiB)

4000 IOPs provisioning max – at $0.065 per IOP/month ($260)

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

IO Problems

Page 8: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

• Amazon EBS Provisioned IOPS (SSD) volumes

• $0.125 per GB-month of provisioned storage

• $0.065 per provisioned IOPS-month

• EX40-SSD

• 32 GB RAM, 2x240 SSD, i7-4770

• ~60 euros

• Load time - ~ 3hours (plus inferencing / indexing)

New Server

Page 9: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

fio AWS VM Dedicated

Read IOPS 155 91937

Read Bandwidth 636KB/s 367.7MB/s

Write IOPS 23 11345

Write Bandwidth 96KB/s 45.3MB/s

IO Comparison

Page 10: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

2.0 Gb RDF/XML

3.2 Gb MySQL database (pre inference)

6.1 Gb MySQL database (post inference)

Transfer slows dramatically after ~ 1Gb written

Regains speed after ~2Gb

MySQL – Demo Dataset

Page 11: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

Processing Data

Page 12: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

Fast (~8-12ms per individual)

However…

2 million individuals = 6-7 hours

Large datasets still slow down (up to 60ms per individual)

Memory problems

Suspect IndexListener

Inferencing

Page 13: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

Query for graphs• Co-authorship

Client #1 • SDB – 10 secs• TDB – 1 sec

Triple store performance

Page 14: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

Using YourKit profiler to show SQL executed

No evidence of complex queries

Combined predicates, functions appear to be processed in Java

Is performance of TDB down to in-memory vs SQL parsing?

Simple SQL Queries

Page 15: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

select g, count(*) from Quads whereg IN (-364693509095697557,786347385076487474)GROUP BY g;

24 seconds

select count(*) from Quads;

14.72 seconds

select count(g) from Quads whereg=786347385076487474

4.16 seconds

MySQL Performance

Total rows: 24,647,663

Page 16: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

Co-author graph query executed• On page access• On GraphML retrieval

Two queries = twice the effort

When each takes 10 secs rather than 1…

Redundant Effort

Page 17: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

Number of triples not necessarily relevant

Small queries still execute quickly

Amount of data matched by SPARQL important• This may include parts of the query• 1 author may have

• 90 publications• 10 investigator roles (grants)

Result sets vs Triples

Page 18: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

Would subproperties give simpler queries with fewer results?e.g.

vivo:hasAuthorshipvivo:hasInvestigatorRole

As subproperties of vivo:relates

Parent property can be inferred and available

Should subproperties be used to ease understanding?vivo:bearerOf vs obo:RO_0000053

(UI hides ontologies with labels, but not from developers)

So, More Triples?

Page 19: Symplectic.co.uk VIVO ISF: Investigating Speed Factors Graham Triggs Head of Repository Systems graham@symplectic.co.uk @grahamtriggs

symplectic.co.uk

Thank you!