con9611 capture, organize, and analyze big data for research
TRANSCRIPT
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
1/43
Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
2/43
Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
Capture, Organize, and Analyze Big Data for Res
Agenda:
Discovering the undiscovered. Applying analytical methods to Big Data in Spa
Domains, Salim Ansari, Head of Technical Services, European Space Agency
Big Data & Security in Healthcare, Ron Hutchins, T! & Associate "ice #rovo
Research & Technology, $eorgia Tech
E%adata for the e%ome. Big Data for medics & ife scientists, 'erven Bolleman
Scientist, S(iss Bio)nformatics )nstitute
Supporting Researchers and the arge Hadron ollider (ith !racle,Tony ass
Data*ase Services Head, ER+
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
3/43
Discovering the undiscovered. Applying anamethods to Big Data in Space Domains
Salim AnsariOracle OpenWorld 2013
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
4/43
We have enough datathe archives to keep
busy for the next 20 ye
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
5/43
However, we will need
years to analyze all thdata if we continue to u
conventional analytic
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
6/43
If we had one drop o
inspiration in the sciendomains of what pouinto the business wor
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
7/43
!he case of the "stronomical "rchive
Based on outdated technologies
Have very little intuitive intelligenc
Do not have basic analytical capab
Based on basic paradigms (search object, search by attribute)
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
8/43
What if###
We created a powerful analytical eto apply to the data?
We revolutionied the way in whichsearched for data in the archives?
!pplied mar"eting methods to sciedata?
#reated the domain of $cience
%ntelligence&&&'??
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
9/43
$elect an item
%e o&ered with sinteresting optio
!he "mazon#com example
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
10/43
'ultiwavelength "stronomy
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
11/43
!he (racle )hallenge
Bring inspiration bac" by using meto stimulate the mind
egular updates from the archives
remind you of what you are wor"in
ush intelligence induced messageabout interesting topics to help foc
attention on the problem at hand&
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
12/43
Big Data & Security in Health
+on Hutchins, )!(eorgia Institute of
!echnology
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
13/43
-uick and dirty outlin
• 'otivation / why big data inhealthcare – 1ersonalized medicine / each of u
di&erent#
• Issues around big data / – 1olicy and security and morality
• 1ossible aids to solving issues#–
$ f H lth )
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
14/43
$ummary of Health )arthe 5$• 67espite its high cost, the US healthcare syst
produces relatively short life spans, and is
wasteful, inecient and has serious safetyquality issues# #### our approaches to care delnancial incentives were designed for a bygone %eyond that the technology oered to practhas often been overly expensive, poorly de
overly proprietary, hard to implement and to use# $purred by a unique, one-time Fedestimulus and the new mobile, wireless and cloutechnologies now available, this landscape is rachanging# 8need to understand the new driving
and have a basic understanding of contemporar
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
15/43
Watson is %ig 7ata for diagnosing / depends on good data9
• 7iagnostic capabilities based on previous diagnosis and outco
clinical knowledge, existing molecular and genomic data and vrepository of cancer case histories, in order to create an outcoevidence:based decision support system#;
#ibm#com=press=us=en=pressrelease=>?2>@#wss
• Well1oint trained Watson with AB,000 historical cases# Cow Wa
hypothesis generation and evidence:based learning to generacondence:scored recommendations that help nurses make dabout utilization management#
#ibm#com=innovation=us=watson=pdf=Well1ointD)aseD$tudyDI'
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
16/43
term view:collecting your personal dat
the cloud• Gawbone 51• itbit
• Cike uelband
• %asis
• 4ark
• +unJeeper
• 'ap'yitness
)h i 7i
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
17/43
)hronic 7isease'anagement
• )hanges in healthcare laws meaningful use3 change ho
providers are paid and ultimately impact our ability tomanage overall healthcare costs
• "pplication of analytics, mobility, process improvement,healthcare is a hot area with respect to research,commercialization, and education :: it is also inspiring st
8 %ut.• Health record sells on the black market for K@0, $$C for
raising the stakes for fraud#
• Why have we not been hacked WeLre Must lucky
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
18/43
!he 1roblems with 7a
• 'uch of the data today is falling on the Noor after beingwe donLt know how to store it or use it cost e&ectively
• ata is messy / many current data standards are propbest in practice and those on the horizon are cumbersom
• "nonymization=deidentication helps some for research
is disruptive to industry incumbents• 1ipelines for data analysis=research vary greatly, and mo
starts without !nowing what speci"c data is neede
• We live in a mobile=remote access world, and space is atpremium#
What are research issu
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
19/43
What are research issutoday
• how can we architect systems and networkeOciently scale given the constraints – Cetworks / private networks
– $ecurity / privacy and protection
– Identity management / access control, auth:z
– )omputation : how do we analyze=mine – $torage and databases / scaling and securing
– !ools and services monitoring=management=aleauditing access
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
20/43
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
21/43
$o8 your personal data in the clo
• "ccording to an CIH assistant director the greatest risk of data
healthcare is unencrypted data on an unlocked laptop that get• 'ulti:tenancy / hosting multiple datasets on the same hardwar
6complicated;
• 'ulti:use of a dataset / sharing the same dataset across multipproMects is against policy at times
• 'ash:ups / combining protected data with other datasets e#g#
against some policy• 1olicy is di&erent based on who creates the policy / are there r
confusion and human error in data hosting
• )loud based apps / data loss and exposure
• e#g# 7ropbox breech
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
22/43
How are we securing the data. worst
HI1"" is all about protecting the identity of the patient : but th
about protecting against fraud and reputation• 2 factor physical access to a controlled environment dedicat
• 2 factor data access to a controlled and dedicated set of resonetwork access beyond resources dedicated to the proMect
• 7ata including derivative data3 must be encrypted at rest anread:only to researchers, compliant destruction, oversight of
the results, and not stored or mixed with data sets which areexplicitly allowed
• !raining, I+% and data source specic 7ata 'anagement 1lanprocedures, logging, and auditing of access and activity : thiresearcher and system=network administration#
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
23/43
"nd then there is8
• 'oral issues around big data• C$" has caused a stir8
• $o, what about meta:data Is it
important• 7o we protect meta:data the sa
was as personal data
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
24/43
!adata for the
#ig $ata for "edics % life sc
&erven #olle"an
$eveloper
'(# ')iss (nstit*te of #ioinfor"atics
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
25/43
© 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+
• EQ groups
• Q@0 collaborators
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
26/43
© 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+
-o provide ey co"pe% research s*pport to the
life science co""
$trategic oal
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
27/43
© 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+
volume3PB storage
velocity24 TB per week
variety100’s of sources
veracityexperimental
uncertainty
Palue
d t t
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
28/43
A
hat the )aiter )rites do)n
$A genes
3en* in the resta*rant
© 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+
!adata "eets !o"e
4roteins
-he "eal delivered
!o"e
5ist of all ordered "eals
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
29/43
© 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+
3 3*rta6a et al# Cature 000, 178 9201+ doi:10.10+;/nat*re120
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
30/43
© 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+
A@0% of information
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
31/43
© 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+
A@0% of informationper person
)
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
32/43
@A5?
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
33/43
© 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+
@A5?
• 'ec*re
– access – =ac*ps
• ?pgrada=le
• 3aintaina=le
• ast data ingestion
• ors )ithin the )ee
• 'B5/'4AB5
–lo) developer c –set =ased analy
–B*ery lang*age
–good for *es
• Adapta=le
–e) *estion
–e) *eryD
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
34/43
d"itry.*6netsovEis=7si=.ch
Ferven=olle"an
Ferven.=olle"anEis=7si=.ch
-his is $"itry, )ho
did the hard )or
mailto:[email protected]:[email protected]:[email protected]:[email protected]
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
35/43
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
36/43
'*pporting esearchers
and the 5arge Gadron Collider )ith Oracle
-ony Cass
Gead of $ata=ase 'ervices
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
37/43
>/12/1< $oc*"ent reference
Oracle 4arallel B*ery vs Gado
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
38/43
Oracle 4arallel B*ery vs Gado
Hraphics co*rtesy of I=ignie) #arano)si
Exadata tes
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
39/43
Exadata tes
+J
With HCC I could run the benchmark analysis on 2
210 seconds, with I/ reads u! to "10 #B/s• 27 faster using !y"rid Colu#nar Co#pressio
Hybrid Columnar Com!resmy data &olume by a factor• HCC made electron)sele• HCC made *et)selection
due to eternal function used
scan &electron'(ta"le scan &)et'(ta"le
scan &electron'(ta"le scan &)et'(ta"le
*ay"e another Exadata t• ith #ore co#plex a
• -n(#e#ory colu#nar
-uery acti&ity without HCC.
-uery acti&ity with HCC :
'lide co*rtesy of 3aaie 5i"per
!adoop /s Oracle
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
40/43
p
!adoop /ersion of 01! "ench#ar analysis+• 4hysics7data stored as co""a7deli"ited te!t7files in hadoop filesyste
• eprod*ce IDG =ench"ar analysis )ith 3aped*ce7code 9FavaK• 3appers: one "appers per o=Fect to select "*on, electron etc.• ed*ce: select events )ith 2 good leptons and 2 =7Fets, calc*late inva
O
5
!adoop+
57 seconds
limited *y #-
C openla= "aFor revie) 'epte"=er'lide co*rtesy of 3aaie 5i"per
!adoop /s Oracle
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
41/43
!adoop /s Oracle
I/ reads s!eed com!arison for the H benchmark
!adoop+
up to 5844 *B9s
Ora
up
C openla= "aFor revie) 'epte"=er'lide co*rtesy of 3aaie 5i"per
Co#parison :ith root
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
42/43
p
>2
Run root file("ased analysis in parallel on all 6 nodes
12L nt*ples copied and divided to the 8 ! 12 diss in o*r cl*s
;se a set of scripts to run the ntuple(analysis on all node
• Copy root7"acro to each node• 12 root7Fo=s per node si"*ltaneo*sly, each Fo= reads fro" • 3erge histogra" prod*ced =y s*=7Fo=s to o=tain final res*• like a miniature)&ersion of runnin+ *obs on the +rid
r
u
root(ntuple analysis+
-
8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research
43/43