con9611 capture, organize, and analyze big data for research

Upload: suchai

Post on 07-Jul-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    1/43

    Copyright © 2011, Oracle and/or its affiliates. All rights reserved.

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    2/43

    Copyright © 2011, Oracle and/or its affiliates. All rights reserved.

    Capture, Organize, and Analyze Big Data for Res

     Agenda:

    Discovering the undiscovered. Applying analytical methods to Big Data in Spa

    Domains, Salim Ansari, Head of Technical Services, European Space Agency

    Big Data & Security in Healthcare, Ron Hutchins, T! & Associate "ice #rovo

    Research & Technology, $eorgia Tech

    E%adata for the e%ome. Big Data for medics & ife scientists, 'erven Bolleman

    Scientist, S(iss Bio)nformatics )nstitute

    Supporting Researchers and the arge Hadron ollider (ith !racle,Tony ass

    Data*ase Services Head, ER+ 

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    3/43

    Discovering the undiscovered. Applying anamethods to Big Data in Space Domains

    Salim AnsariOracle OpenWorld 2013

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    4/43

    We have enough datathe archives to keep

    busy for the next 20 ye

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    5/43

    However, we will need

    years to analyze all thdata if we continue to u

    conventional analytic

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    6/43

    If we had one drop o

    inspiration in the sciendomains of what pouinto the business wor

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    7/43

     !he case of the "stronomical "rchive

    Based on outdated technologies

    Have very little intuitive intelligenc

    Do not have basic analytical capab

    Based on basic paradigms (search object, search by attribute)

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    8/43

    What if###

    We created a powerful analytical eto apply to the data?

    We revolutionied the way in whichsearched for data in the archives?

    !pplied mar"eting methods to sciedata?

    #reated the domain of $cience

    %ntelligence&&&'??

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    9/43

    $elect an item

    %e o&ered with sinteresting optio

     !he "mazon#com example

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    10/43

    'ultiwavelength "stronomy

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    11/43

     !he (racle )hallenge

    Bring inspiration bac" by using meto stimulate the mind

    egular updates from the archives

    remind you of what you are wor"in

    ush intelligence induced messageabout interesting topics to help foc

    attention on the problem at hand&

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    12/43

     Big Data & Security in Health

    +on Hutchins, )!(eorgia Institute of

     !echnology

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    13/43

    -uick and dirty outlin

    • 'otivation / why big data inhealthcare – 1ersonalized medicine / each of u

    di&erent#

    • Issues around big data / – 1olicy and security and morality

    • 1ossible aids to solving issues#–

    $ f H lth )

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    14/43

    $ummary of Health )arthe 5$• 67espite its high cost, the US healthcare syst

    produces relatively short life spans, and is

    wasteful, inecient and has serious safetyquality issues# #### our approaches to care delnancial incentives were designed for a bygone %eyond that the technology oered to practhas often been overly expensive, poorly de

    overly proprietary, hard to implement and to use# $purred by a unique, one-time Fedestimulus and the new mobile, wireless and cloutechnologies now available, this landscape is rachanging# 8need to understand the new driving

    and have a basic understanding of contemporar

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    15/43

    Watson is %ig 7ata for diagnosing / depends on good data9

    • 7iagnostic capabilities based on previous diagnosis and outco

    clinical knowledge, existing molecular and genomic data and vrepository of cancer case histories, in order to create an outcoevidence:based decision support system#;

    #ibm#com=press=us=en=pressrelease=>?2>@#wss

    • Well1oint trained Watson with AB,000 historical cases# Cow Wa

    hypothesis generation and evidence:based learning to generacondence:scored recommendations that help nurses make dabout utilization management#

    #ibm#com=innovation=us=watson=pdf=Well1ointD)aseD$tudyDI'

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    16/43

      term view:collecting your personal dat

    the cloud• Gawbone 51• itbit

    • Cike uelband

    • %asis

    • 4ark

    • +unJeeper

    • 'ap'yitness

    )h i 7i

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    17/43

    )hronic 7isease'anagement

    • )hanges in healthcare laws meaningful use3 change ho

    providers are paid and ultimately impact our ability tomanage overall healthcare costs

    • "pplication of analytics, mobility, process improvement,healthcare is a hot area with respect to research,commercialization, and education :: it is also inspiring st

    8 %ut.• Health record sells on the black market for K@0, $$C for

    raising the stakes for fraud#

    • Why have we not been hacked WeLre Must lucky

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    18/43

     !he 1roblems with 7a

    • 'uch of the data today is falling on the Noor after beingwe donLt know how to store it or use it cost e&ectively

    • ata is messy / many current data standards are propbest in practice and those on the horizon are cumbersom

    • "nonymization=deidentication helps some for research

    is disruptive to industry incumbents• 1ipelines for data analysis=research vary greatly, and mo

    starts without !nowing what speci"c data is neede

    • We live in a mobile=remote access world, and space is atpremium#

    What are research issu

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    19/43

    What are research issutoday

    • how can we architect systems and networkeOciently scale given the constraints – Cetworks / private networks

     – $ecurity / privacy and protection

     – Identity management / access control, auth:z

     – )omputation : how do we analyze=mine – $torage and databases / scaling and securing

     – !ools and services monitoring=management=aleauditing access

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    20/43

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    21/43

    $o8 your personal data in the clo

    • "ccording to an CIH assistant director the greatest risk of data

    healthcare is unencrypted data on an unlocked laptop that get• 'ulti:tenancy / hosting multiple datasets on the same hardwar

    6complicated;

    • 'ulti:use of a dataset / sharing the same dataset across multipproMects is against policy at times

    • 'ash:ups / combining protected data with other datasets e#g#

    against some policy• 1olicy is di&erent based on who creates the policy / are there r

    confusion and human error in data hosting

    • )loud based apps / data loss and exposure

    •  e#g# 7ropbox breech

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    22/43

    How are we securing the data. worst

    HI1"" is all about protecting the identity of the patient : but th

    about protecting against fraud and reputation• 2 factor physical access to a controlled environment dedicat

    • 2 factor data access to a controlled and dedicated set of resonetwork access beyond resources dedicated to the proMect

    • 7ata including derivative data3 must be encrypted at rest anread:only to researchers, compliant destruction, oversight of

    the results, and not stored or mixed with data sets which areexplicitly allowed

    •  !raining, I+% and data source specic 7ata 'anagement 1lanprocedures, logging, and auditing of access and activity : thiresearcher and system=network administration#

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    23/43

    "nd then there is8

    • 'oral issues around big data• C$" has caused a stir8

    • $o, what about meta:data Is it

    important• 7o we protect meta:data the sa

    was as personal data

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    24/43

    !adata for the

    #ig $ata for "edics % life sc

    &erven #olle"an

    $eveloper 

    '(# ')iss (nstit*te of #ioinfor"atics

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    25/43

    © 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+

    • EQ groups

    • Q@0 collaborators

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    26/43

    © 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+

    -o provide ey co"pe% research s*pport to the

    life science co""

    $trategic oal

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    27/43

    © 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+

    volume3PB storage

    velocity24 TB per week

    variety100’s of sources

    veracityexperimental

    uncertainty

    Palue

    d t t

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    28/43

    A

    hat the )aiter )rites do)n

    $A genes

    3en* in the resta*rant

    © 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+

    !adata "eets !o"e

    4roteins

    -he "eal delivered

    !o"e

    5ist of all ordered "eals

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    29/43

    © 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+

    3 3*rta6a et al# Cature 000, 178 9201+ doi:10.10+;/nat*re120

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    30/43

    © 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+

    A@0% of information

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    31/43

    © 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+

    A@0% of informationper person

    )

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    32/43

    @A5?

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    33/43

    © 201+ '(# ')iss (nst*te of #ioinfor"aticsOracle Open orld 201+

    @A5?

    • 'ec*re

     – access – =ac*ps

    • ?pgrada=le

    • 3aintaina=le

    • ast data ingestion

    • ors )ithin the )ee

    • 'B5/'4AB5

     –lo) developer c –set =ased analy

     –B*ery lang*age

     –good for *es

    • Adapta=le

     –e) *estion

     –e) *eryD

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    34/43

    d"itry.*6netsovEis=7si=.ch

     Ferven=olle"an

     Ferven.=olle"anEis=7si=.ch

    -his is $"itry, )ho

    did the hard )or

    mailto:[email protected]:[email protected]:[email protected]:[email protected]

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    35/43

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    36/43

    '*pporting esearchers

    and the 5arge Gadron Collider )ith Oracle

    -ony Cass

    Gead of $ata=ase 'ervices

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    37/43

    >/12/1< $oc*"ent reference

    Oracle 4arallel B*ery vs Gado

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    38/43

    Oracle 4arallel B*ery vs Gado

    Hraphics co*rtesy of I=ignie) #arano)si

    Exadata tes

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    39/43

    Exadata tes

    +J

    With HCC I could run the benchmark analysis on 2

    210 seconds, with I/ reads u! to "10 #B/s• 27 faster using !y"rid Colu#nar Co#pressio

    Hybrid Columnar Com!resmy data &olume by a factor• HCC made electron)sele• HCC made *et)selection

    due to eternal function used

    scan &electron'(ta"le scan &)et'(ta"le

    scan &electron'(ta"le scan &)et'(ta"le

    *ay"e another Exadata t• ith #ore co#plex a

    • -n(#e#ory colu#nar

    -uery acti&ity without HCC.

    -uery acti&ity with HCC :

    'lide co*rtesy of 3aaie 5i"per 

    !adoop /s Oracle

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    40/43

    p

    !adoop /ersion of 01! "ench#ar analysis+• 4hysics7data stored as co""a7deli"ited te!t7files in hadoop filesyste

    • eprod*ce IDG =ench"ar analysis )ith 3aped*ce7code 9FavaK• 3appers: one "appers per o=Fect to select "*on, electron etc.• ed*ce: select events )ith 2 good leptons and 2 =7Fets, calc*late inva

    O

    5

    !adoop+

    57 seconds

    limited *y #-

    C openla= "aFor revie) 'epte"=er'lide co*rtesy of 3aaie 5i"per 

    !adoop /s Oracle

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    41/43

    !adoop /s Oracle

    I/ reads s!eed com!arison for the H benchmark 

    !adoop+

    up to 5844 *B9s

    Ora

    up

    C openla= "aFor revie) 'epte"=er'lide co*rtesy of 3aaie 5i"per 

    Co#parison :ith root

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    42/43

    p

    >2

    Run root file("ased analysis in parallel on all 6 nodes

    12L nt*ples copied and divided to the 8 ! 12 diss in o*r cl*s

    ;se a set of scripts to run the ntuple(analysis on all node

    • Copy root7"acro to each node• 12 root7Fo=s per node si"*ltaneo*sly, each Fo= reads fro" • 3erge histogra" prod*ced =y s*=7Fo=s to o=tain final res*• like a miniature)&ersion of runnin+ *obs on the +rid 

    r

    u

    root(ntuple analysis+

  • 8/18/2019 CON9611 Capture, Organize, And Analyze Big Data for Research

    43/43