data and society lecture 3: data and computingbermaf/data course 2015/data and society lecture...

Fran Berman, Data and Society, CSCI 4967/6963

Data and Society Lecture 3: Data and Computing

2/13/15


Announcements

• All students: send 2 sample essay exam questions on Lectures 1, 2, 3 material to [email protected] by February 18

• Next week is guest speaker Colin Bodell, CTO of Time, Inc. and former head of Kindle technology team at Amazon

– Please attend class and be prepared to ask Colin many questions

mailto:[email protected]


Today (2/13/15)

• Any questions about Lecture 2?

• Lecture 3: Data and Computing

– Some history

– IT-enabled Applications

– Cyberinfrastructure

– Building Data Cyberinfrastructure at SDSC

• Break

• Data Roundtable (Miguel, Alex, Yusri, Robert, Lars, Oskari)

3


Section Theme Date First “half” Second “half”

Section 1: The Data Ecosystem -- Fundamentals

January 30 Class introduction; Digital data in the 21st Century (L1)

Data Roundtable / Fran

February 6 Data Stewardship and Preservation (L2)

L1 Data Roundtable / 5 students

February 13 Data and Computing (L3) L2 Data Roundtable / 6 students

February 20 Colin Bodel, Time Inc. CTO Guest Lecture and Q&A


Section 2: Data and Innovation – How has data transformed science and society?

February 27 Section 1 Exam Data and the Health Sciences (L4)

March 6 Paper preparation / no class

March 13 Data and Entertainment (L5) L4 Data Roundtable / 6 students

March 20 Big Data Applications (L6) L5 Data Roundtable / 5 students

Section 3: Data and Community – Social infrastructure for a data-driven world

April 3 Data in the Global Landscape (L7) Section 2 paper due


April 10 Bulent Yener Guest Lecture, Data Privacy / Bad guys on the Internet (L8)


April 17 Data and the Workforce (L9) L8 Data Roundtable / 6 students

April 24 Mike Schroepfer, Facebook CTO Guest Lecture and Q&A

May 1 Data Futures (L10) L9 Data Roundtable / 5 students

May 8 Section 3 Exam L10 Data Roundtable / 5 students

We are here


Lecture 3: Data and Computing


What is the potential impact of Global

Warming?

How will natural disasters effect urban centers?

What therapies can be used to cure or

control cancer?

Can we accurately predict market outcomes?

Computational modeling, simulation, analysis a critical

tool in addressing science and societal challenges

What plants work best for biofuels?

Is there life on other planets?


Computational science increasing focus in the 80’s and 90’s (data issues often in the background …)

• Many reports in 80’s and early 90’s focused on the potential of information technologies (primarily computers and high-speed networks) to address key scientific and societal challenges

• First federal “Blue Book” in 1992 focused on key computational problems including

– Weather forecasting

– Cancer genes

– Predicting new superconductors

– Aerospace vehicle design

– Air pollution

– Energy conservation and turbulent combustion

– Microsystems design and packaging

– Earth’s bioshpere

– Broader education resources


Enabling IT: Increasing focus on a broader spectrum of resources

COMPUTE (more FLOPS)

DA

TA

(m

ore

BY

TE

S)

Home, Lab,

Campus, Desktop

Applications

Compute-

intensive

HPC

Applications

Data-intensive

and

Compute-

intensive

HPC

applications

Compute-intensive Grid,

Distributed, and Cloud

Applications

Data - oriented

Grid, Distributed

and Cloud

Applications

NETWORK

(more

BW)

Data-intensive

applications

More key resources: • Software • Human resources

(workforce) ‘80’s, 90’s +: Computational Science ‘90’s, 00’s +: Informatics 00’s, 10’s +: Data Science


In the beginning … The Branscomb Pyramid, circa 1993

Branscomb Pyramid provides a framework to associate computational power with community use.

Original Branscomb Committee Report (“From Desktop to TeraFlop”) at http://www.csci.psu.edu/docs/branscomb.txt

http://www.csci.psu.edu/docs/branscomb.txt

http://www.csci.psu.edu/docs/branscomb.txt


The Branscomb Pyramid, circa 2015

Small-scale devices and personal

computers

Small-scale Campus/Commercial

Clusters

Large-scale campus/commercial

resources, Center supercomputers

Leadership Class

PF EF

TF, PF

TF

MF, GF

Opportunities for Innovation at all levels …

Kilo 103

Mega 106

Giga 109

Tera 1012

Peta 1015

Exa 1018

Zetta 1021

Yotta 1024


Also in 1993: The Top500 List created to rank supercomputers

• TOP500 list ranks and details the 500 most powerful supercomputers in the world

• Most powerful = performance on the LinPack benchmark.

• Rankings provide invaluable statistics on supercomputer trends by country, vendor, sector, processor characteristics, etc.

• List compiled by Hans Meuer of University of Mannheim, Jack dongarra of University of Tennessee, and Erich Strohmaier and Horst simon of NERSC / LBNL. List comes out in November and June each year.

http://top500.org/


What the Top500 List measures

Rmax and Rpeak values are in TFlops

• Computers assessed based on their performance on the LINPACK Benchmark – calculating the solution to a dense system of linear equations.

– User may scale the size of the problem and optimize the software in order to achieve the best performance for a given machine

– Algorithm used must conform to LU factorization with partial pivoting (operation count for the algorithm must be 2/3 n^3 + O(n^2) double precision floating point operations.

• Rpeak values calculated using the advertised clock rate of the CPU. (theoretical performance)

• Rmax = maximial LINPACK performance achieved (actual performance)

Rensselaer CCI Blue Gene Q on current Top500 list (November 2014):

• 59th most powerful supercomputer in the world

• 20th most powerful Academic supercomputer in the world

• 4th most powerful Academic supercomputer in the US


Performance Development (Slide courtesy of Jack Dongarra)

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

59.7 GFlop/s

400 MFlop/s

1.17 TFlop/s

10.5 PFlop/s

51 TFlop/s

74 PFlop/s

SUM

N=1

N=500

6-8 years

Jack’s Laptop (12 Gflop/s)

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

Jack’s iPad2 & iPhone 4s (1.02 Gflop/s)


Fast Forward to 2015: Development of coordinated / integrated research information infrastructure on all axes needed enabling application innovation

COMPUTE (more FLOPS)

DA

TA (

mo

re B

YTES

)

NETWORK (more BW)

Protein analysis and modeling of function and

structures

Storage and Analysis of Data from the CERN Large Hadron

Collider

Development of biofuels

Cosmology

Seti@home, MilkyWay@Home, BOINC

Real-time disaster response


Compute / Data Applications

• Terashake Earthquake Simulation

• Enzo Cosmology Modeling

• Large Hadron Collider Data Analysis


Earthquake Simulation


Earthquake Simulation

Background:

• Earth constantly evolving through the

movement of “plates”

• Using plate tectonics, the Earth's outer shell

(lithosphere) is posited to consist of seven large

and many smaller moving plates.

• As the plates move, their boundaries collide,

spread apart or slide past one another, resulting

in geological processes such as earthquakes and

tsunamis, volcanoes and the development of

mountains, typically at plate boundaries.

http://upload.wikimedia.org/wikipedia/commons/5/52/Pacific_Ring_of_Fire.svg


Why Earthquake Simulations are Important

Terrestrial earthquakes damage homes, buildings, bridges, highways

Tsunamis come from earthquakes in the ocean

• If we understand how earthquakes can happen, we can

– Predict which places might be hardest hit

– Reinforce bridges and buildings to increase safety

– Prepare police, fire fighters and doctors in high-risk areas to increase their effectiveness

• Information technologies drive more accurate earthquake simulation


6.7 M earthquake in Northridge California,1994, earthquake brought estimated $20B damage

Major Earthquakes on the San Andreas Fault, 1680-present

1906 M 7.8

1857 M 7.8

1680 M 7.7

?

What would be the impact of an earthquake on the lower San Andreas fault?


Simulation decomposition strategy leverages parallel high performance computers

– Southern California partitioned into “cubes” then mapped onto processors of high performance computer

– Data choreography used to move data in and out of memory during processing

Builds on data and models from the Southern California Earthquake Center, Kinematic source (from Denali) focuses on Cajon Creek to Bombay Beach


TeraShake Simulation

Simulation of Southern of 7.7 earthquake on lower San Andreas Fault

• Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m

• Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each

• Simulation for TeraShake 1 and 2 simulations generated 45+ TB data


Under the surface


Resources must support a

complicated orchestration of

computation and data

movement

47 TB output

data for 1.8

billion grid

points

Continuous

I/O 2GB/sec

240 procs on

SDSC Datastar,

5 days, 1 TB

of main memory

“Fat Nodes” with 256 GB of

DataStar used for pre-processing

and post visualization

10-20 TB data archived a day

Finer resolution simulations require even more resources.

TeraShake scaled to run on petascale architectures

Behind the Scenes: TeraShake Data Choreography

Parallel file system

Data parking


Terashake and Data

• Data Management

– 10 Terabytes moved per day during

execution over 5 days

– Derived data products registered into

SCEC digital library (total SCEC library

had 168 TB)

• Data Post-processing:

– Movies of seismic wave propagation

– Seismogram formatting for interactive

on-line analysis

– Derived data:

• Velocity magnitude

• Displacement vector field

• Cumulative peak maps

• Statistics used in visualizations

TeraShake Resources

Computers and Systems

• 80,000 hours on IBM Power 4 (DataStar)

• 256 GB memory p690 used for testing, p655s used for production run,

TeraGrid used for porting

• 30 TB Global Parallel file GPFS

• Run-time 100 MB/s data transfer from

GPFS to SAM-QFS

• 27,000 hours post-processing for high

resolution rendering

People

• 20+ people for IT support

• 20+ people in domain research

Storage

• SAM-QFS archival storage

• HPSS backup

• SRB Collection with 1,000,000 files


TeraShake at Petascale – better prediction accuracy creates greater

resource demands

Estimated figures for simulated 240

second period, 100 hour run-time

TeraShake domain (600x300x80 km^3)

PetaShake domain

(800x400x100 km^3)

Fault system interaction

NO YES

Inner Scale 200m 25m

Resolution of terrain grid

1.8 billion mesh points

2.0 trillion mesh points

Magnitude of Earthquake

7.7 8.1

Time steps 20,000

(.012 sec/step) 160,000

(.0015 sec/step)

Surface data 1.1 TB 1.2 PB

Volume data 43 TB 4.9 PB

Information courtesy of the Southern California Earthquake Center

Tera 1012

Peta 1015


Application Evolution

• TeraShake PetaSHA, PetaShake, CyberShake, etc. at SCEC

• Evolving applications improving resolution, models, simulation accuracy, scope of results, etc.

• PetaSHA foci:

– Create a hierarchy of simulations for 10 most probably large (M>7) ruptures in southern California

– Validation of earthquake simulations using well-recorded regional events (M<=6.7) and assimilation of regional waveform data into community velocity models

– Validation of hazard curves and extension of maps to higher frequencies and more extensive geographic coverage, creating rich new database for earthquake scientists and engineers


Cosmology

Fran Berman, Data and Society, CSCI 4967/6963 Slide modified from Mike Norman

Cosmology Modeling – How did the universe evolve after the “Big Bang”?

• Astrophysicists consider this a key period that represents

– Tumultuous period of intense star formation throughout the universe

– Synthesis of the first heavy elements in massive stars

– Supernovae, gamma-ray bursts, seed black holes, and the corresponding growth of supermassive black holes and the birth of quasars

– Assembly of the first galaxies


Evolving the Universe from the “Big Bang”

now then Dump

1

Dump

2

Dump

3

Dump

4

Dump

5

Composing simulation outputs from different

timeframes builds up light-cone volume

(light-cone shows the set of prior events which

could have caused the event under

consideration.)

Slide modified from Mike Norman


ENZO Community Code

Enzo simulates the first billion years of cosmic evolution after the “Big Bang” . • Community code: Enzo 2.0 is the product of developments made at UC San Diego,

Stanford, Princeton, Columbia, MSU, CU Boulder, CITA, McMaster, SMU, and UC Berkeley.

• Code and materials available at enzo.googlecode.com

What Enzo does: • Calculates the growth of cosmic

structure from seed perturbations to form stars, galaxies, and galaxy clusters, including simulation of

– Dark matter

– Ordinary matter (atoms)

– Self-gravity

– Cosmic expansion Form

ation o

f a g

ala

xy c

luste

r



• Enzo solves dark matter N-body dynamics using a particle mesh

technique

– Poisson equation solved using a combination of FFT and multigrid

techniques

– Euler’s equations of hydrodynamics solved using a modified version of the

piecewise parabolic method

– (Dark matter: nonluminous material that is postulated to exist in space and that could take

any of several forms including weakly interacting particles ( cold dark matter ) or high-

energy randomly moving particles created soon after the Big Bang ( hot dark matter ).)

• Enzo employs adaptive mesh refinement (AMR) to achieve high

spatial resolution in 3D

– The Santa Fe light cone

simulation generated

over 350,000 grids at 7

levels of refinement

– Effective resolution = 65,5363

Level 0 Level 1 Level 2

AMR

ENZO general approach



Enzo data volumes grow with simulation size

• 20483 simulation (2008)

– 8 gigazones X

– 16 fields/zone X

– 4 bytes/field

– = 0.5 TB/output X

– 100 outputs/run

– = 50 TB

• 40963 simulation (2009)

– 64 gigazones X

– 16 fields/zone X

– 4 bytes/field

– = 4 TB/output X

– 50 outputs/run

– = 200 TB


Application Evolution

Enzo Enzo-P (petascale Enzo), Cello (scalable adaptive mesh refinement framework)

ENZO at Petascale

● Self-consistent radiation-hydro simulations of structural, chemical, and radiative evolution of the universe simulates from first stars to first galaxies

● Sophisticated techniques to improve parallelism and scaling, reduce communication, adapt to emerging petascale programming and run-time environments, promote data locality and efficient I/O


Many CS challenges:

● Scaling adaptive mesh refinement

● Developing task scheduling and data locality approaches synergistic with dynamic load balancing approach

● Transition from MPI to Charm++ parallelism

● Inline data analysis/viz. to reduce I/O

● Adaptation to next generation of programming environments and supercomputer architectures


Large Hadron Collider Data Analysis


Large Hadron Collider / CERN

Slide adapted from Jamie Shiers, CERN


Results


The Large Hadron Collider (LHC)

• LHC is the world’s most powerful particle collider.

– Allows physicists to test various theories of particle physics and high energy physics

– LHC experiments expected to address key unsolved problems in physics

• LHC contains seven detectors, each designed for a different kind of research.

• LHC collisions produce 10’s of PBs of data per year.

– Subset of data analyzed by distributed grid of 170+ computers in 36 countries

A collider is a type of a particle accelerator with two directed beams of particles.

In particle physics colliders are used as a research tool: they accelerate particles to

very high kinetic energies and let them impact other particles.

Analysis of the byproducts of these collisions gives scientists good evidence of the structure of the subatomic world and

the laws of nature governing it.

Many of these byproducts are produced only by high energy collisions, and they

decay after very short periods of time. Thus many of them are hard or near impossible

to study in other ways.

Wikipedia


LHC data analysis

• LHC Computing Grid designed to handle LHC data.

– 6 PB of data from LHC proton-proton collisions analyzed by 2012

– Collision and simulation data produced estimated at roughly 15+ PB per year, “uninteresting” events not preserved or analyzed.

• LHC Computing Grid uses private fiber optic cables and high-speed portions of the Internet to transfer CERN data to academic institutions around the world for analysis.

– Tree of grid nodes in tiers. Tier 0 is CERN computer center.

– 11 Tier 1 institutions connected via LHC Optical Private Network and receive specific subsets of raw data and serve as backup data repository for CERN.

– More than 150 Tier 2 institutions connected to Tier 1 institutions. Tier 2 sites handle analysis requirements and a proportional share of simulated event production and reconstruction.

• Data stream from detectors provides ~300GB of data. Filtered for “interesting events”. Resulting “raw” data stream is ~300 MB/s (~ 27 TB/day).

– Data stream also contains 10 TB/day of “event summary” data


Worldwide LHC Computing Grid

• Distributed infrastructure of 150 computing centers in 40 countries

• 300+ k CPU cores (~ 2M HEP-SPEC-06)

• The biggest site with ~50k CPU cores, 12 T2 with 2-30k CPU cores

• Distributed data, services and operation infrastructure



Data: Outlook for HL-LHC

• Very rough estimate of a new RAW data per year of running using a simple extrapolation

of current data volume scaled by the output rates.

• To be added: derived data (ESD, AOD), simulation, user data…

At least 0.5 EB / year (x 10 years of data taking)

PB

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

Run 1 Run 2 Run 3 Run 4

CMS

ATLAS

ALICE

LHCb

We are here!



LHC – Stewardship and Preservation Challenges

• Significant volumes of high energy physics data are thrown away “at birth” – i.e. via very strict filters (aka triggers) before writing to storage. To a first approximation, all remaining data needs to be preserved for a few decades.

– LHC data particularly valuable as reproducibility of experiments is tremendously expensive and almost impossible to achieve

• Tier 0 and 1 sites currently provide bit preservation at scale

– Data more usable and accessible when services coupled with bit preservation



Post-collision

David South | Data Preservation and Long Term Analysis in HEP | CHEP 2012, May 21-25 2012 | Page 6

After the collisions have stopped

> Finish the analyses! But then what do you do with the data?

§ Until recently, there was no clear policy on this in the HEP community

§ It’s possible that older HEP experiments have in fact simply lost the data

> Data preservation, including long term access, is generally not part of

the planning, software design or budget of an experiment

§ So far, HEP data preservation initiatives have been in the main not planned by the

original collaborations, but rather the effort a few knowledgeable people

> The conservation of tapes is not equivalent to

data preservation!

§ “We cannot ensure data is stored in file formats appropriate for

long term preservation”

§ “The software for exploiting the data is under the control of the

experiments”

§ “We are sure most of the data are not easily accessible!”



Cyberinfrastructure


• Cyberinfrastructure is the

organized aggregate of

information technologies

coordinated to address

problems in science and

society

• Cyberinfrastructure components:

– Digital data

– Computers

– Wireless and wireline

networks

– Personal digital devices

– Scientific instruments

– Storage

– Software

– Sensors

– People …

“Cyberinfrastructure” (aka e-Infrastructure)

“If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for

a knowledge economy.”

NSF Final Report of the Blue Ribbon Advisory Panel on Cyberinfrastructure

(“Atkins Report”, 2003)


Cyberinfrastructure (CI) an emerging national focus in 2000 and beyond

• Publication of the Atkins report

accelerated CI as a critical national focus

within federal R&D investments and

especially at NSF

• CI elevated from a division within CISE

directorate to an Office within NSF.

(Atkins became first OCI Director).

• San Diego Supercomputer Center

(SDSC), a pioneer in data-intensive

computing, focused on leadership in

data cyberinfrastructure and data-

enabled applications

Atkins Report: http://www.nsf.gov/cise/sci/reports/atkins.pdf


Building Data Cyberinfrastructure at SDSC, 2001 - 2009


SDSC in a Nutshell:

• 1985 – 2004: NSF supercomputer / cyberinfrastructure center hosted at UCSD. 2004 – present: UCSD center with national impact.

• Multi-faceted facility focused on cyberinfrastructure-enabled applications, high performance computing, data stewardship

– Several hundred research, technology, and productions systems -focused staff

– Home to 100+ national research projects, allocated national machines, research data infrastructure

– Funded by NSF, Department of Energy, NIH, DHS, Library of Congress, National Archives and Records Administration, UC system, etc.

Data Cyberinfrastructure at the San Diego Supercomputer Center (SDSC), 2001-2009


SDSC Strategic Focus in the 2000’s : Support for Data-oriented science from the small to the large-scale

• Special needs at the extremes …

• Data Cyberinfrastructure should support

– Petabyte sized collections

– 100 PetaByte archive

– Collections which must be preserved 100 years or more

– Data-oriented simulation, analysis, and modeling at 10-100X university/research lab-level capacities

– Professional data services, software, curation beyond what is feasible in university, campus, and research lab facilities

size

number

timeframe

capability

SW support


SDSC Data Cyberinfrastructure / Selected Projects

SDSC

Data CI

DATA PORTALS

DATA

VISUALIZATION

DATA

MANAGEMENT

HPC

DATA

DATA

ANALYTICS

DATA

SERVICES

DATA

STORAGE

DATA

PRESERVATION

Data

Oasis


SDSC Data Central

• One of the first general-purpose programs of its kind to support research and community data collections and databases

• Data Central was available without charge to the scientific community and provided a facility to store, manage, analyze, mine, share and publish data collections, enabling access and collaboration in the broader scientific community

• Project led by Natasha Balac at SDSC

Who could apply

• Open to researchers affiliated with US educational institutions

• Proposals were merit-reviewed quarterly by Data Allocations Committee

Types of Allocations:

• Expedited Allocations

– 1 TB or less of disk & tape 1st year

– 5 GB Database 1st year

– Yearly review

• Medium Allocations

– Under 30 TB

• Large Allocations

– Larger than 30 TB


DataCentral Allocated Collections included

Seismology 3D Ground Motion Collection for the LA Basin

Atmospheric Sciences50 year Downscaling of Global Analysis over California Region

Earth Sciences NEXRAD Data in Hydrometerology and Hydrology

Elementary Particle Physics

AMANDA data

Biology AfCS Molecule Pages

Biomedical Neuroscience

BIRN

Networking Backbone Header Traces

Networking Backscatter Data

Biology Bee Behavior

Biology Biocyc (SRI)

Art C5 landscape Database

Geology Chronos

Biology CKAAPS

Biology DigEmbryo

Earth Science Education

ERESE

Earth Sciences UCI ESMF

Earth Sciences EarthRef.org

Earth Sciences ERDA

Earth Sciences ERR

Biology Encyclopedia of Life

Life Sciences Protein Data Bank

Geosciences GEON

Geosciences GEON-LIDAR

Geochemistry Kd

Biology Gene Ontology

Geochemistry GERM

Networking HPWREN

Ecology HyperLter

Networking IMDC

Biology Interpro Mirror

Biology JCSG Data

Government Library of Congress Data

Geophysics Magnetics Information Consortium data

Education UC Merced Japanese Art Collections

Geochemistry NAVDAT

Earthquake Engineering

NEESIT data

Education NSDL

Astronomy NVO

Government NARA

Anthropology GAPP

Neurobiology Salk data

Seismology SCEC TeraShake

Seismology SCEC CyberShake

Oceanography SIO Explorer

Networking Skitter

Astronomy Sloan Digital Sky Survey

Geology Sensitive Species Map Server

Geology SD and Tijuana Watershed data

Oceanography Seamount Catalogue

Oceanography Seamounts Online

Biodiversity WhyWhere

Ocean Sciences Southeastern Coastal Ocean Observing and Prediction Data

Structural Engineering TeraBridge

Various TeraGrid data collections

Biology Transporter Classification Database

Biology TreeBase

Geoscience Tsunami Data

Education ArtStor

Biology Yeast regulatory network

Biology Apoptosis Database

Cosmology LUSciD


Data Resources at SDSC

• SDSC data comes from instruments, computers, experiments, sensors, and other sources and is largely available to the community through the web

• SDSC data infrastructure strove to meet multiple needs:

– High performance and high reliability

– Generic support (DataCentral) and custom support (PDB, etc.)

– “Best effort” and preservation-level

– Broad range of data types, formats, sizes, sources, policies, etc.

Data Resources at SDSC


Focus of Computing Procurements: Supercomputers that supported Data-Intensive Applications

• A balanced system to support data-oriented applications requires a trade-off of flops and other key system characteristics

• Balanced system provides support for tightly-coupled and strong I/O applications

– Grid platforms not a strong option – Data local to computation – I/O rates exceed WAN capabilities – Continuous and frequent I/O is latency intolerant

• Scalability is key

– Need high-bandwidth and large-capacity local parallel file systems

– Need large-capacity flexible “parking” storage for post-processing

– Need high-bandwidth and large-scale archival storage

• Application performance determines the best configuration

Locality Data Stride 0 Ignored

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

Spatial Locality

Te

mp

ora

l lo

ca

lity

IBM Pwr3

Linpack

RandomAccess STREAM

Overflow

DoD apps plotted for locality

Compute

Data


Data Activities at SDSC in 2015

• Data Oasis – 4 PB of storage for users of SDSC Gordon and Trestles

• Workflow for Data Science (WorDS) Center of Excellence to provide resource for developing and validating scientific workflows related to data curation, analysis, visualization, dissemination

• Data mining boot camps (PACE – Predictive Analytics Center of Excellence)


Lecture Materials (not already on slides)

• More information about Enzo-P: https://www.xsede.org/documents/271087/369160/ExtScale-Bordner.pdf

• Enzo data simulation on YouTube (set to music no less!): http://www.youtube.com/watch?v=N-nSh_-8_xk

• Southern California Earthquake Center, http://www.scec.org/

• Atkins Report: http://www.nsf.gov/cise/sci/reports/atkins.pdf

• LHC, www.wikipedia.com

• Worldwide LHC Computing Grid website, http://wlcg-public.web.cern.ch/tier-centres

• San Diego Supercomputer Center, www.sdsc.edu

https://www.xsede.org/documents/271087/369160/ExtScale-Bordner.pdf





http://www.youtube.com/watch?v=N-nSh_-8_xk





http://www.scec.org/



http://www.nsf.gov/cise/sci/reports/atkins.pdf

http://www.nsf.gov/cise/sci/reports/atkins.pdf

http://www.wikipedia.com/

http://wlcg-public.web.cern.ch/tier-centres






http://www.sdsc.edu/


Break


Grading Detail – Section 2 Undergrad Paper

Specs

• Paper: 6-8 pages, 1.5 spaces, 12 pt font

• PDF due to [email protected] by 8:45 a.m. on April 3.

• Focus of paper should be an area of science or society that has been transformed by the availability of digital data

• General outline:

– Description of the area and how data has transformed it

– Specifics on the kind of innovation in the application of data has made this possible

– What kind of infrastructure is needed to make this possible

– Conclusion and thoughts about transformative potential of data in this area in the future

• Paper should include adequate references and bibliography (not included in the page count)

• Send a 1-2 page description of the topic of the paper and a detailed outline in .pdf to [email protected] by 12:00 a.m on March 9.




UG Section 2 Paper Grading Metrics (20 points total)

Writing (10 points):

• Is the paper well-organized, readable by non-specialists, and credible to specialists?

• Is the writing articulate and compelling?

• Is the paper well-structured with the main points backed up by evidence?

Content (10 points):

• Does the paper content provide adequate depth and evidence to describe the transformation of an area by digital data?

• Are the references reasonable and adequate?


Grading Detail – Section 2 Grad mini-proposal

Specs

• Mini-Proposal: 10 pages, 1.5 spaces, 12 pt font

• PDF due to [email protected] by 8:45 a.m. on April 3.

• Focus of mini-proposal is a research project in an area of science or society that has been transformed by the availability of digital data.

• Target solicitation: NSF Computational and Data-Enabled Science and Engineering (CDS&E) program (http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504813&org=CISE&sel_org=CISE&from=fund )

• Mini-proposal should include:

– 1 page proposal summary providing Description, Intellectual Merit, and Broader Impacts sections

– 9 page proposal description including:

• Introduction (what is the proposal about?) • Related work (relevant to proposed project) • Project plan (plan / approach for accomplishing the work) • Metrics of success • Conclusion

• Mini-proposal should include adequate references and bibliography (not included in the page count)

• Send a 1-2 page description of the topic of the paper and a detailed outline in .pdf to [email protected] by 12:00 a.m on March 9.


http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504813&org=CISE&sel_org=CISE&from=fund

http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504813&org=CISE&sel_org=CISE&from=fund



Grad Mini-proposal Grading Metrics (20 points total)

Content (10 points):

• Does the project have a clear focus and a research plan?

• Do the metrics of success adequately support the goals and approach?

• Is the project a departure from related work and are the references reasonable and adequate?

Writing (10 points):

• Does the proposal follow the guidelines provided? Is it well-organized, readable by non-specialists, and credible to specialists?

• Is the writing articulate and compelling?

• Is the proposal well structured?


Heilmeier’s Catechism (from Wikipedia)

George Heilmeier was former Director of DARPA (Defense Advanced Research Projects Agency , former CTO of Texas Instruments, former President of Bellcore, and former CEO of SAIC.

Heilmeier’s Catechnism is a set of questions credited to Heilmeier that anyone proposing a research project or product development effort should be able to answer:

• What are you trying to do? Articulate your objectives using absolutely no jargon.

• How is it done today, and what are the limits of current practice?

• What's new in your approach and why do you think it will be successful?

• Who cares?

• If you're successful, what difference will it make?

• What are the risks and the payoffs?

• How much will it cost?

• How long will it take?

• What are the midterm and final "exams" to check for success? (metrics of success)


Data Round Table


Next week: L3 Data Roundtable for February 20

• The 4th Paradigm – Data Intensive Scientific Discovery. 2009. “A 2020 Vision for Ocean Science” article http://research.microsoft.com/en-us/collaboration/fourthparadigm/ (Philip Cioni)

• “Making Research and Data Cyberinfrastructure Real”, Educause Review, July/August 2008, http://www.cs.rpi.edu/~bermaf/Educause%20CI%20FINAL%2008.pdf (Pranshu Saxena)

• “Star is crowded by Super-Earths”, BBC (June, 2013), http://www.bbc.co.uk/news/science-environment-23032467 (Dennis Fogerty)

• “Scientists Use Stargazing Technology in the Fight against Cancer,” Time (February, 2013) http://healthland.time.com/2013/02/27/scientists-use-stargazing-technology-in-the-fight-against-cancer/ (Ted Tenedorio)

• “Digital Keys for Unlocking the Humanities’ Riches”, the New York Times, http://www.nytimes.com/2010/11/17/arts/17digital.html?pagewanted=all&_r=0 (Juan Poma)

http://research.microsoft.com/en-us/collaboration/fourthparadigm/




http://www.cs.rpi.edu/~bermaf/Educause CI FINAL 08.pdf

http://www.cs.rpi.edu/~bermaf/Educause CI FINAL 08.pdf

http://www.bbc.co.uk/news/science-environment-23032467






http://healthland.time.com/2013/02/27/scientists-use-stargazing-technology-in-the-fight-against-cancer/



















http://www.nytimes.com/2010/11/17/arts/17digital.html?pagewanted=all&_r=0

http://www.nytimes.com/2010/11/17/arts/17digital.html?pagewanted=all&_r=0


Today: Lecture 2 Data Roundtable

• “Got Data? A Guide to Digital Preservation in the Information Age,” CACM (December, 2008) http://www.cs.rpi.edu/~bermaf/CACM08.pdf (Miguel Lantigua-Inoa)

• “A Digital Life,” Scientific American (March, 2007) http://www.scientificamerican.com/article/a-digital-life/ (Alex Karcher)

• “Thirteen Ways of Looking at … Digital Preservation,” D-Lib Magazine (August, 2004), http://www.dlib.org/dlib/july04/lavoie/07lavoie.html (Yusri Jamaluddin)

• “The Lost NASA Tapes: Restoring Lunar Images after 40 Years in the Vault”, ComputerWorld (June, 2009), http://www.computerworld.com/article/2525935/computer-hardware/the-lost-nasa-tapes--restoring-lunar-images-after-40-years-in-the-vault.html?page=2 (Robert Stephens)

• Princeton Single-Pay Storage model (2 students):

– “DataSpace: A Funding and Operational Model for LongTerm Preservation and Sharing of Research Data” White Paper (August, 2010), http://dataspace.princeton.edu/jspui/bitstream/88435/dsp01w6634361k/1/DataSpaceFundingModel_20100827.pdf (Lars Olsson)

– “Paying for Long Term Storage”, DSHR’s Blog (February, 2011), http://blog.dshr.org/2011/02/paying-for-long-term-storage.html (Oskari Rautiainen)

http://www.cs.rpi.edu/~bermaf/CACM08.pdf

http://www.cs.rpi.edu/~bermaf/CACM08.pdf

http://www.scientificamerican.com/article/a-digital-life/






http://www.dlib.org/dlib/july04/lavoie/07lavoie.html

http://www.dlib.org/dlib/july04/lavoie/07lavoie.html

http://www.computerworld.com/article/2525935/computer-hardware/the-lost-nasa-tapes--restoring-lunar-images-after-40-years-in-the-vault.html?page=2




























http://dataspace.princeton.edu/jspui/bitstream/88435/dsp01w6634361k/1/DataSpaceFundingModel_20100827.pdf




http://blog.dshr.org/2011/02/paying-for-long-term-storage.html











data and society lecture 3: data and computingbermaf/data course 2015/data and society lecture...

Documents