data and society lecture 3: data and computingbermaf/data course 2015/data and society lecture...
TRANSCRIPT
Fran Berman, Data and Society, CSCI 4967/6963
Data and Society Lecture 3: Data and Computing
2/13/15
Fran Berman, Data and Society, CSCI 4967/6963
Announcements
• All students: send 2 sample essay exam questions on Lectures 1, 2, 3 material to [email protected] by February 18
• Next week is guest speaker Colin Bodell, CTO of Time, Inc. and former head of Kindle technology team at Amazon
– Please attend class and be prepared to ask Colin many questions
Fran Berman, Data and Society, CSCI 4967/6963
Today (2/13/15)
• Any questions about Lecture 2?
• Lecture 3: Data and Computing
– Some history
– IT-enabled Applications
– Cyberinfrastructure
– Building Data Cyberinfrastructure at SDSC
• Break
• Data Roundtable (Miguel, Alex, Yusri, Robert, Lars, Oskari)
3
Fran Berman, Data and Society, CSCI 4967/6963
Section Theme Date First “half” Second “half”
Section 1: The Data Ecosystem -- Fundamentals
January 30 Class introduction; Digital data in the 21st Century (L1)
Data Roundtable / Fran
February 6 Data Stewardship and Preservation (L2)
L1 Data Roundtable / 5 students
February 13 Data and Computing (L3) L2 Data Roundtable / 6 students
February 20 Colin Bodel, Time Inc. CTO Guest Lecture and Q&A
L3 Data Roundtable / 5 students
Section 2: Data and Innovation – How has data transformed science and society?
February 27 Section 1 Exam Data and the Health Sciences (L4)
March 6 Paper preparation / no class
March 13 Data and Entertainment (L5) L4 Data Roundtable / 6 students
March 20 Big Data Applications (L6) L5 Data Roundtable / 5 students
Section 3: Data and Community – Social infrastructure for a data-driven world
April 3 Data in the Global Landscape (L7) Section 2 paper due
L6 Data Roundtable / 6 students
April 10 Bulent Yener Guest Lecture, Data Privacy / Bad guys on the Internet (L8)
L7 Data Roundtable / 5 students
April 17 Data and the Workforce (L9) L8 Data Roundtable / 6 students
April 24 Mike Schroepfer, Facebook CTO Guest Lecture and Q&A
May 1 Data Futures (L10) L9 Data Roundtable / 5 students
May 8 Section 3 Exam L10 Data Roundtable / 5 students
We are here
Fran Berman, Data and Society, CSCI 4967/6963
Lecture 3: Data and Computing
Fran Berman, Data and Society, CSCI 4967/6963
What is the potential impact of Global
Warming?
How will natural disasters effect urban centers?
What therapies can be used to cure or
control cancer?
Can we accurately predict market outcomes?
Computational modeling, simulation, analysis a critical
tool in addressing science and societal challenges
What plants work best for biofuels?
Is there life on other planets?
Fran Berman, Data and Society, CSCI 4967/6963
Computational science increasing focus in the 80’s and 90’s (data issues often in the background …)
• Many reports in 80’s and early 90’s focused on the potential of information technologies (primarily computers and high-speed networks) to address key scientific and societal challenges
• First federal “Blue Book” in 1992 focused on key computational problems including
– Weather forecasting
– Cancer genes
– Predicting new superconductors
– Aerospace vehicle design
– Air pollution
– Energy conservation and turbulent combustion
– Microsystems design and packaging
– Earth’s bioshpere
– Broader education resources
Fran Berman, Data and Society, CSCI 4967/6963
Enabling IT: Increasing focus on a broader spectrum of resources
COMPUTE (more FLOPS)
DA
TA
(m
ore
BY
TE
S)
Home, Lab,
Campus, Desktop
Applications
Compute-
intensive
HPC
Applications
Data-intensive
and
Compute-
intensive
HPC
applications
Compute-intensive Grid,
Distributed, and Cloud
Applications
Data - oriented
Grid, Distributed
and Cloud
Applications
NETWORK
(more
BW)
Data-intensive
applications
More key resources: • Software • Human resources
(workforce) ‘80’s, 90’s +: Computational Science ‘90’s, 00’s +: Informatics 00’s, 10’s +: Data Science
Fran Berman, Data and Society, CSCI 4967/6963
In the beginning … The Branscomb Pyramid, circa 1993
Branscomb Pyramid provides a framework to associate computational power with community use.
Original Branscomb Committee Report (“From Desktop to TeraFlop”) at http://www.csci.psu.edu/docs/branscomb.txt
Fran Berman, Data and Society, CSCI 4967/6963
The Branscomb Pyramid, circa 2015
Small-scale devices and personal
computers
Small-scale Campus/Commercial
Clusters
Large-scale campus/commercial
resources, Center supercomputers
Leadership Class
PF EF
TF, PF
TF
MF, GF
Opportunities for Innovation at all levels …
Kilo 103
Mega 106
Giga 109
Tera 1012
Peta 1015
Exa 1018
Zetta 1021
Yotta 1024
Fran Berman, Data and Society, CSCI 4967/6963
Also in 1993: The Top500 List created to rank supercomputers
• TOP500 list ranks and details the 500 most powerful supercomputers in the world
• Most powerful = performance on the LinPack benchmark.
• Rankings provide invaluable statistics on supercomputer trends by country, vendor, sector, processor characteristics, etc.
• List compiled by Hans Meuer of University of Mannheim, Jack dongarra of University of Tennessee, and Erich Strohmaier and Horst simon of NERSC / LBNL. List comes out in November and June each year.
http://top500.org/
Fran Berman, Data and Society, CSCI 4967/6963
What the Top500 List measures
Rmax and Rpeak values are in TFlops
• Computers assessed based on their performance on the LINPACK Benchmark – calculating the solution to a dense system of linear equations.
– User may scale the size of the problem and optimize the software in order to achieve the best performance for a given machine
– Algorithm used must conform to LU factorization with partial pivoting (operation count for the algorithm must be 2/3 n^3 + O(n^2) double precision floating point operations.
• Rpeak values calculated using the advertised clock rate of the CPU. (theoretical performance)
• Rmax = maximial LINPACK performance achieved (actual performance)
Rensselaer CCI Blue Gene Q on current Top500 list (November 2014):
• 59th most powerful supercomputer in the world
• 20th most powerful Academic supercomputer in the world
• 4th most powerful Academic supercomputer in the US
Fran Berman, Data and Society, CSCI 4967/6963
Performance Development (Slide courtesy of Jack Dongarra)
0.1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
59.7 GFlop/s
400 MFlop/s
1.17 TFlop/s
10.5 PFlop/s
51 TFlop/s
74 PFlop/s
SUM
N=1
N=500
6-8 years
Jack’s Laptop (12 Gflop/s)
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Jack’s iPad2 & iPhone 4s (1.02 Gflop/s)
Fran Berman, Data and Society, CSCI 4967/6963
Fast Forward to 2015: Development of coordinated / integrated research information infrastructure on all axes needed enabling application innovation
COMPUTE (more FLOPS)
DA
TA (
mo
re B
YTES
)
NETWORK (more BW)
Protein analysis and modeling of function and
structures
Storage and Analysis of Data from the CERN Large Hadron
Collider
Development of biofuels
Cosmology
Seti@home, MilkyWay@Home, BOINC
Real-time disaster response
Fran Berman, Data and Society, CSCI 4967/6963
Compute / Data Applications
• Terashake Earthquake Simulation
• Enzo Cosmology Modeling
• Large Hadron Collider Data Analysis
Fran Berman, Data and Society, CSCI 4967/6963
Earthquake Simulation
Fran Berman, Data and Society, CSCI 4967/6963
Earthquake Simulation
Background:
• Earth constantly evolving through the
movement of “plates”
• Using plate tectonics, the Earth's outer shell
(lithosphere) is posited to consist of seven large
and many smaller moving plates.
• As the plates move, their boundaries collide,
spread apart or slide past one another, resulting
in geological processes such as earthquakes and
tsunamis, volcanoes and the development of
mountains, typically at plate boundaries.
Fran Berman, Data and Society, CSCI 4967/6963
Why Earthquake Simulations are Important
Terrestrial earthquakes damage homes, buildings, bridges, highways
Tsunamis come from earthquakes in the ocean
• If we understand how earthquakes can happen, we can
– Predict which places might be hardest hit
– Reinforce bridges and buildings to increase safety
– Prepare police, fire fighters and doctors in high-risk areas to increase their effectiveness
• Information technologies drive more accurate earthquake simulation
Fran Berman, Data and Society, CSCI 4967/6963
6.7 M earthquake in Northridge California,1994, earthquake brought estimated $20B damage
Major Earthquakes on the San Andreas Fault, 1680-present
1906 M 7.8
1857 M 7.8
1680 M 7.7
?
What would be the impact of an earthquake on the lower San Andreas fault?
Fran Berman, Data and Society, CSCI 4967/6963
Simulation decomposition strategy leverages parallel high performance computers
– Southern California partitioned into “cubes” then mapped onto processors of high performance computer
– Data choreography used to move data in and out of memory during processing
Builds on data and models from the Southern California Earthquake Center, Kinematic source (from Denali) focuses on Cajon Creek to Bombay Beach
Fran Berman, Data and Society, CSCI 4967/6963
TeraShake Simulation
Simulation of Southern of 7.7 earthquake on lower San Andreas Fault
• Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m
• Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each
• Simulation for TeraShake 1 and 2 simulations generated 45+ TB data
Fran Berman, Data and Society, CSCI 4967/6963
Under the surface
Fran Berman, Data and Society, CSCI 4967/6963
Resources must support a
complicated orchestration of
computation and data
movement
47 TB output
data for 1.8
billion grid
points
Continuous
I/O 2GB/sec
240 procs on
SDSC Datastar,
5 days, 1 TB
of main memory
“Fat Nodes” with 256 GB of
DataStar used for pre-processing
and post visualization
10-20 TB data archived a day
Finer resolution simulations require even more resources.
TeraShake scaled to run on petascale architectures
Behind the Scenes: TeraShake Data Choreography
Parallel file system
Data parking
Fran Berman, Data and Society, CSCI 4967/6963
Terashake and Data
• Data Management
– 10 Terabytes moved per day during
execution over 5 days
– Derived data products registered into
SCEC digital library (total SCEC library
had 168 TB)
• Data Post-processing:
– Movies of seismic wave propagation
– Seismogram formatting for interactive
on-line analysis
– Derived data:
• Velocity magnitude
• Displacement vector field
• Cumulative peak maps
• Statistics used in visualizations
TeraShake Resources
Computers and Systems
• 80,000 hours on IBM Power 4 (DataStar)
• 256 GB memory p690 used for testing, p655s used for production run,
TeraGrid used for porting
• 30 TB Global Parallel file GPFS
• Run-time 100 MB/s data transfer from
GPFS to SAM-QFS
• 27,000 hours post-processing for high
resolution rendering
People
• 20+ people for IT support
• 20+ people in domain research
Storage
• SAM-QFS archival storage
• HPSS backup
• SRB Collection with 1,000,000 files
Fran Berman, Data and Society, CSCI 4967/6963
TeraShake at Petascale – better prediction accuracy creates greater
resource demands
Estimated figures for simulated 240
second period, 100 hour run-time
TeraShake domain (600x300x80 km^3)
PetaShake domain
(800x400x100 km^3)
Fault system interaction
NO YES
Inner Scale 200m 25m
Resolution of terrain grid
1.8 billion mesh points
2.0 trillion mesh points
Magnitude of Earthquake
7.7 8.1
Time steps 20,000
(.012 sec/step) 160,000
(.0015 sec/step)
Surface data 1.1 TB 1.2 PB
Volume data 43 TB 4.9 PB
Information courtesy of the Southern California Earthquake Center
Tera 1012
Peta 1015
Fran Berman, Data and Society, CSCI 4967/6963
Application Evolution
• TeraShake PetaSHA, PetaShake, CyberShake, etc. at SCEC
• Evolving applications improving resolution, models, simulation accuracy, scope of results, etc.
• PetaSHA foci:
– Create a hierarchy of simulations for 10 most probably large (M>7) ruptures in southern California
– Validation of earthquake simulations using well-recorded regional events (M<=6.7) and assimilation of regional waveform data into community velocity models
– Validation of hazard curves and extension of maps to higher frequencies and more extensive geographic coverage, creating rich new database for earthquake scientists and engineers
Fran Berman, Data and Society, CSCI 4967/6963
Cosmology
Fran Berman, Data and Society, CSCI 4967/6963 Slide modified from Mike Norman
Cosmology Modeling – How did the universe evolve after the “Big Bang”?
• Astrophysicists consider this a key period that represents
– Tumultuous period of intense star formation throughout the universe
– Synthesis of the first heavy elements in massive stars
– Supernovae, gamma-ray bursts, seed black holes, and the corresponding growth of supermassive black holes and the birth of quasars
– Assembly of the first galaxies
Fran Berman, Data and Society, CSCI 4967/6963
Evolving the Universe from the “Big Bang”
now then Dump
1
Dump
2
Dump
3
Dump
4
Dump
5
Composing simulation outputs from different
timeframes builds up light-cone volume
(light-cone shows the set of prior events which
could have caused the event under
consideration.)
Slide modified from Mike Norman
Fran Berman, Data and Society, CSCI 4967/6963
ENZO Community Code
Enzo simulates the first billion years of cosmic evolution after the “Big Bang” . • Community code: Enzo 2.0 is the product of developments made at UC San Diego,
Stanford, Princeton, Columbia, MSU, CU Boulder, CITA, McMaster, SMU, and UC Berkeley.
• Code and materials available at enzo.googlecode.com
What Enzo does: • Calculates the growth of cosmic
structure from seed perturbations to form stars, galaxies, and galaxy clusters, including simulation of
– Dark matter
– Ordinary matter (atoms)
– Self-gravity
– Cosmic expansion Form
ation o
f a g
ala
xy c
luste
r
Slide modified from Mike Norman
Fran Berman, Data and Society, CSCI 4967/6963
• Enzo solves dark matter N-body dynamics using a particle mesh
technique
– Poisson equation solved using a combination of FFT and multigrid
techniques
– Euler’s equations of hydrodynamics solved using a modified version of the
piecewise parabolic method
– (Dark matter: nonluminous material that is postulated to exist in space and that could take
any of several forms including weakly interacting particles ( cold dark matter ) or high-
energy randomly moving particles created soon after the Big Bang ( hot dark matter ).)
• Enzo employs adaptive mesh refinement (AMR) to achieve high
spatial resolution in 3D
– The Santa Fe light cone
simulation generated
over 350,000 grids at 7
levels of refinement
– Effective resolution = 65,5363
Level 0 Level 1 Level 2
AMR
ENZO general approach
Slide modified from Mike Norman
Fran Berman, Data and Society, CSCI 4967/6963
Enzo data volumes grow with simulation size
• 20483 simulation (2008)
– 8 gigazones X
– 16 fields/zone X
– 4 bytes/field
– = 0.5 TB/output X
– 100 outputs/run
– = 50 TB
• 40963 simulation (2009)
– 64 gigazones X
– 16 fields/zone X
– 4 bytes/field
– = 4 TB/output X
– 50 outputs/run
– = 200 TB
Fran Berman, Data and Society, CSCI 4967/6963
Application Evolution
Enzo Enzo-P (petascale Enzo), Cello (scalable adaptive mesh refinement framework)
ENZO at Petascale
● Self-consistent radiation-hydro simulations of structural, chemical, and radiative evolution of the universe simulates from first stars to first galaxies
● Sophisticated techniques to improve parallelism and scaling, reduce communication, adapt to emerging petascale programming and run-time environments, promote data locality and efficient I/O
Slide modified from Mike Norman
Many CS challenges:
● Scaling adaptive mesh refinement
● Developing task scheduling and data locality approaches synergistic with dynamic load balancing approach
● Transition from MPI to Charm++ parallelism
● Inline data analysis/viz. to reduce I/O
● Adaptation to next generation of programming environments and supercomputer architectures
Fran Berman, Data and Society, CSCI 4967/6963
Large Hadron Collider Data Analysis
Fran Berman, Data and Society, CSCI 4967/6963
Large Hadron Collider / CERN
Slide adapted from Jamie Shiers, CERN
Fran Berman, Data and Society, CSCI 4967/6963
Results
Fran Berman, Data and Society, CSCI 4967/6963
The Large Hadron Collider (LHC)
• LHC is the world’s most powerful particle collider.
– Allows physicists to test various theories of particle physics and high energy physics
– LHC experiments expected to address key unsolved problems in physics
• LHC contains seven detectors, each designed for a different kind of research.
• LHC collisions produce 10’s of PBs of data per year.
– Subset of data analyzed by distributed grid of 170+ computers in 36 countries
A collider is a type of a particle accelerator with two directed beams of particles.
In particle physics colliders are used as a research tool: they accelerate particles to
very high kinetic energies and let them impact other particles.
Analysis of the byproducts of these collisions gives scientists good evidence of the structure of the subatomic world and
the laws of nature governing it.
Many of these byproducts are produced only by high energy collisions, and they
decay after very short periods of time. Thus many of them are hard or near impossible
to study in other ways.
Wikipedia
Fran Berman, Data and Society, CSCI 4967/6963
LHC data analysis
• LHC Computing Grid designed to handle LHC data.
– 6 PB of data from LHC proton-proton collisions analyzed by 2012
– Collision and simulation data produced estimated at roughly 15+ PB per year, “uninteresting” events not preserved or analyzed.
• LHC Computing Grid uses private fiber optic cables and high-speed portions of the Internet to transfer CERN data to academic institutions around the world for analysis.
– Tree of grid nodes in tiers. Tier 0 is CERN computer center.
– 11 Tier 1 institutions connected via LHC Optical Private Network and receive specific subsets of raw data and serve as backup data repository for CERN.
– More than 150 Tier 2 institutions connected to Tier 1 institutions. Tier 2 sites handle analysis requirements and a proportional share of simulated event production and reconstruction.
• Data stream from detectors provides ~300GB of data. Filtered for “interesting events”. Resulting “raw” data stream is ~300 MB/s (~ 27 TB/day).
– Data stream also contains 10 TB/day of “event summary” data
Fran Berman, Data and Society, CSCI 4967/6963
Worldwide LHC Computing Grid
• Distributed infrastructure of 150 computing centers in 40 countries
• 300+ k CPU cores (~ 2M HEP-SPEC-06)
• The biggest site with ~50k CPU cores, 12 T2 with 2-30k CPU cores
• Distributed data, services and operation infrastructure
Slide adapted from Jamie Shiers, CERN
Fran Berman, Data and Society, CSCI 4967/6963
Data: Outlook for HL-LHC
• Very rough estimate of a new RAW data per year of running using a simple extrapolation
of current data volume scaled by the output rates.
• To be added: derived data (ESD, AOD), simulation, user data…
At least 0.5 EB / year (x 10 years of data taking)
PB
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
Run 1 Run 2 Run 3 Run 4
CMS
ATLAS
ALICE
LHCb
We are here!
Slide adapted from Jamie Shiers, CERN
Fran Berman, Data and Society, CSCI 4967/6963
LHC – Stewardship and Preservation Challenges
• Significant volumes of high energy physics data are thrown away “at birth” – i.e. via very strict filters (aka triggers) before writing to storage. To a first approximation, all remaining data needs to be preserved for a few decades.
– LHC data particularly valuable as reproducibility of experiments is tremendously expensive and almost impossible to achieve
• Tier 0 and 1 sites currently provide bit preservation at scale
– Data more usable and accessible when services coupled with bit preservation
Slide adapted from Jamie Shiers, CERN
Fran Berman, Data and Society, CSCI 4967/6963
Post-collision
David South | Data Preservation and Long Term Analysis in HEP | CHEP 2012, May 21-25 2012 | Page 6
After the collisions have stopped
> Finish the analyses! But then what do you do with the data?
§ Until recently, there was no clear policy on this in the HEP community
§ It’s possible that older HEP experiments have in fact simply lost the data
> Data preservation, including long term access, is generally not part of
the planning, software design or budget of an experiment
§ So far, HEP data preservation initiatives have been in the main not planned by the
original collaborations, but rather the effort a few knowledgeable people
> The conservation of tapes is not equivalent to
data preservation!
§ “We cannot ensure data is stored in file formats appropriate for
long term preservation”
§ “The software for exploiting the data is under the control of the
experiments”
§ “We are sure most of the data are not easily accessible!”
Slide adapted from Jamie Shiers, CERN
Fran Berman, Data and Society, CSCI 4967/6963
Cyberinfrastructure
Fran Berman, Data and Society, CSCI 4967/6963
• Cyberinfrastructure is the
organized aggregate of
information technologies
coordinated to address
problems in science and
society
• Cyberinfrastructure components:
– Digital data
– Computers
– Wireless and wireline
networks
– Personal digital devices
– Scientific instruments
– Storage
– Software
– Sensors
– People …
“Cyberinfrastructure” (aka e-Infrastructure)
“If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for
a knowledge economy.”
NSF Final Report of the Blue Ribbon Advisory Panel on Cyberinfrastructure
(“Atkins Report”, 2003)
Fran Berman, Data and Society, CSCI 4967/6963
Cyberinfrastructure (CI) an emerging national focus in 2000 and beyond
• Publication of the Atkins report
accelerated CI as a critical national focus
within federal R&D investments and
especially at NSF
• CI elevated from a division within CISE
directorate to an Office within NSF.
(Atkins became first OCI Director).
• San Diego Supercomputer Center
(SDSC), a pioneer in data-intensive
computing, focused on leadership in
data cyberinfrastructure and data-
enabled applications
Atkins Report: http://www.nsf.gov/cise/sci/reports/atkins.pdf
Fran Berman, Data and Society, CSCI 4967/6963
Building Data Cyberinfrastructure at SDSC, 2001 - 2009
Fran Berman, Data and Society, CSCI 4967/6963
SDSC in a Nutshell:
• 1985 – 2004: NSF supercomputer / cyberinfrastructure center hosted at UCSD. 2004 – present: UCSD center with national impact.
• Multi-faceted facility focused on cyberinfrastructure-enabled applications, high performance computing, data stewardship
– Several hundred research, technology, and productions systems -focused staff
– Home to 100+ national research projects, allocated national machines, research data infrastructure
– Funded by NSF, Department of Energy, NIH, DHS, Library of Congress, National Archives and Records Administration, UC system, etc.
Data Cyberinfrastructure at the San Diego Supercomputer Center (SDSC), 2001-2009
Fran Berman, Data and Society, CSCI 4967/6963
SDSC Strategic Focus in the 2000’s : Support for Data-oriented science from the small to the large-scale
• Special needs at the extremes …
• Data Cyberinfrastructure should support
– Petabyte sized collections
– 100 PetaByte archive
– Collections which must be preserved 100 years or more
– Data-oriented simulation, analysis, and modeling at 10-100X university/research lab-level capacities
– Professional data services, software, curation beyond what is feasible in university, campus, and research lab facilities
size
number
timeframe
capability
SW support
Fran Berman, Data and Society, CSCI 4967/6963
SDSC Data Cyberinfrastructure / Selected Projects
SDSC
Data CI
DATA PORTALS
DATA
VISUALIZATION
DATA
MANAGEMENT
HPC
DATA
DATA
ANALYTICS
DATA
SERVICES
DATA
STORAGE
DATA
PRESERVATION
Data
Oasis
Fran Berman, Data and Society, CSCI 4967/6963
SDSC Data Central
• One of the first general-purpose programs of its kind to support research and community data collections and databases
• Data Central was available without charge to the scientific community and provided a facility to store, manage, analyze, mine, share and publish data collections, enabling access and collaboration in the broader scientific community
• Project led by Natasha Balac at SDSC
Who could apply
• Open to researchers affiliated with US educational institutions
• Proposals were merit-reviewed quarterly by Data Allocations Committee
Types of Allocations:
• Expedited Allocations
– 1 TB or less of disk & tape 1st year
– 5 GB Database 1st year
– Yearly review
• Medium Allocations
– Under 30 TB
• Large Allocations
– Larger than 30 TB
Fran Berman, Data and Society, CSCI 4967/6963
DataCentral Allocated Collections included
Seismology 3D Ground Motion Collection for the LA Basin
Atmospheric Sciences50 year Downscaling of Global Analysis over California Region
Earth Sciences NEXRAD Data in Hydrometerology and Hydrology
Elementary Particle Physics
AMANDA data
Biology AfCS Molecule Pages
Biomedical Neuroscience
BIRN
Networking Backbone Header Traces
Networking Backscatter Data
Biology Bee Behavior
Biology Biocyc (SRI)
Art C5 landscape Database
Geology Chronos
Biology CKAAPS
Biology DigEmbryo
Earth Science Education
ERESE
Earth Sciences UCI ESMF
Earth Sciences EarthRef.org
Earth Sciences ERDA
Earth Sciences ERR
Biology Encyclopedia of Life
Life Sciences Protein Data Bank
Geosciences GEON
Geosciences GEON-LIDAR
Geochemistry Kd
Biology Gene Ontology
Geochemistry GERM
Networking HPWREN
Ecology HyperLter
Networking IMDC
Biology Interpro Mirror
Biology JCSG Data
Government Library of Congress Data
Geophysics Magnetics Information Consortium data
Education UC Merced Japanese Art Collections
Geochemistry NAVDAT
Earthquake Engineering
NEESIT data
Education NSDL
Astronomy NVO
Government NARA
Anthropology GAPP
Neurobiology Salk data
Seismology SCEC TeraShake
Seismology SCEC CyberShake
Oceanography SIO Explorer
Networking Skitter
Astronomy Sloan Digital Sky Survey
Geology Sensitive Species Map Server
Geology SD and Tijuana Watershed data
Oceanography Seamount Catalogue
Oceanography Seamounts Online
Biodiversity WhyWhere
Ocean Sciences Southeastern Coastal Ocean Observing and Prediction Data
Structural Engineering TeraBridge
Various TeraGrid data collections
Biology Transporter Classification Database
Biology TreeBase
Geoscience Tsunami Data
Education ArtStor
Biology Yeast regulatory network
Biology Apoptosis Database
Cosmology LUSciD
Fran Berman, Data and Society, CSCI 4967/6963
Data Resources at SDSC
• SDSC data comes from instruments, computers, experiments, sensors, and other sources and is largely available to the community through the web
• SDSC data infrastructure strove to meet multiple needs:
– High performance and high reliability
– Generic support (DataCentral) and custom support (PDB, etc.)
– “Best effort” and preservation-level
– Broad range of data types, formats, sizes, sources, policies, etc.
Data Resources at SDSC
Fran Berman, Data and Society, CSCI 4967/6963
Focus of Computing Procurements: Supercomputers that supported Data-Intensive Applications
• A balanced system to support data-oriented applications requires a trade-off of flops and other key system characteristics
• Balanced system provides support for tightly-coupled and strong I/O applications
– Grid platforms not a strong option – Data local to computation – I/O rates exceed WAN capabilities – Continuous and frequent I/O is latency intolerant
• Scalability is key
– Need high-bandwidth and large-capacity local parallel file systems
– Need large-capacity flexible “parking” storage for post-processing
– Need high-bandwidth and large-scale archival storage
• Application performance determines the best configuration
Locality Data Stride 0 Ignored
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90
Spatial Locality
Te
mp
ora
l lo
ca
lity
IBM Pwr3
Linpack
RandomAccess STREAM
Overflow
DoD apps plotted for locality
Compute
Data
Fran Berman, Data and Society, CSCI 4967/6963
Data Activities at SDSC in 2015
• Data Oasis – 4 PB of storage for users of SDSC Gordon and Trestles
• Workflow for Data Science (WorDS) Center of Excellence to provide resource for developing and validating scientific workflows related to data curation, analysis, visualization, dissemination
• Data mining boot camps (PACE – Predictive Analytics Center of Excellence)
Fran Berman, Data and Society, CSCI 4967/6963
Lecture Materials (not already on slides)
• More information about Enzo-P: https://www.xsede.org/documents/271087/369160/ExtScale-Bordner.pdf
• Enzo data simulation on YouTube (set to music no less!): http://www.youtube.com/watch?v=N-nSh_-8_xk
• Southern California Earthquake Center, http://www.scec.org/
• Atkins Report: http://www.nsf.gov/cise/sci/reports/atkins.pdf
• LHC, www.wikipedia.com
• Worldwide LHC Computing Grid website, http://wlcg-public.web.cern.ch/tier-centres
• San Diego Supercomputer Center, www.sdsc.edu
Fran Berman, Data and Society, CSCI 4967/6963
Break
Fran Berman, Data and Society, CSCI 4967/6963
Grading Detail – Section 2 Undergrad Paper
Specs
• Paper: 6-8 pages, 1.5 spaces, 12 pt font
• PDF due to [email protected] by 8:45 a.m. on April 3.
• Focus of paper should be an area of science or society that has been transformed by the availability of digital data
• General outline:
– Description of the area and how data has transformed it
– Specifics on the kind of innovation in the application of data has made this possible
– What kind of infrastructure is needed to make this possible
– Conclusion and thoughts about transformative potential of data in this area in the future
• Paper should include adequate references and bibliography (not included in the page count)
• Send a 1-2 page description of the topic of the paper and a detailed outline in .pdf to [email protected] by 12:00 a.m on March 9.
Fran Berman, Data and Society, CSCI 4967/6963
UG Section 2 Paper Grading Metrics (20 points total)
Writing (10 points):
• Is the paper well-organized, readable by non-specialists, and credible to specialists?
• Is the writing articulate and compelling?
• Is the paper well-structured with the main points backed up by evidence?
Content (10 points):
• Does the paper content provide adequate depth and evidence to describe the transformation of an area by digital data?
• Are the references reasonable and adequate?
Fran Berman, Data and Society, CSCI 4967/6963
Grading Detail – Section 2 Grad mini-proposal
Specs
• Mini-Proposal: 10 pages, 1.5 spaces, 12 pt font
• PDF due to [email protected] by 8:45 a.m. on April 3.
• Focus of mini-proposal is a research project in an area of science or society that has been transformed by the availability of digital data.
• Target solicitation: NSF Computational and Data-Enabled Science and Engineering (CDS&E) program (http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=504813&org=CISE&sel_org=CISE&from=fund )
• Mini-proposal should include:
– 1 page proposal summary providing Description, Intellectual Merit, and Broader Impacts sections
– 9 page proposal description including:
• Introduction (what is the proposal about?) • Related work (relevant to proposed project) • Project plan (plan / approach for accomplishing the work) • Metrics of success • Conclusion
• Mini-proposal should include adequate references and bibliography (not included in the page count)
• Send a 1-2 page description of the topic of the paper and a detailed outline in .pdf to [email protected] by 12:00 a.m on March 9.
Fran Berman, Data and Society, CSCI 4967/6963
Grad Mini-proposal Grading Metrics (20 points total)
Content (10 points):
• Does the project have a clear focus and a research plan?
• Do the metrics of success adequately support the goals and approach?
• Is the project a departure from related work and are the references reasonable and adequate?
Writing (10 points):
• Does the proposal follow the guidelines provided? Is it well-organized, readable by non-specialists, and credible to specialists?
• Is the writing articulate and compelling?
• Is the proposal well structured?
Fran Berman, Data and Society, CSCI 4967/6963
Heilmeier’s Catechism (from Wikipedia)
George Heilmeier was former Director of DARPA (Defense Advanced Research Projects Agency , former CTO of Texas Instruments, former President of Bellcore, and former CEO of SAIC.
Heilmeier’s Catechnism is a set of questions credited to Heilmeier that anyone proposing a research project or product development effort should be able to answer:
• What are you trying to do? Articulate your objectives using absolutely no jargon.
• How is it done today, and what are the limits of current practice?
• What's new in your approach and why do you think it will be successful?
• Who cares?
• If you're successful, what difference will it make?
• What are the risks and the payoffs?
• How much will it cost?
• How long will it take?
• What are the midterm and final "exams" to check for success? (metrics of success)
Fran Berman, Data and Society, CSCI 4967/6963
Data Round Table
Fran Berman, Data and Society, CSCI 4967/6963
Next week: L3 Data Roundtable for February 20
• The 4th Paradigm – Data Intensive Scientific Discovery. 2009. “A 2020 Vision for Ocean Science” article http://research.microsoft.com/en-us/collaboration/fourthparadigm/ (Philip Cioni)
• “Making Research and Data Cyberinfrastructure Real”, Educause Review, July/August 2008, http://www.cs.rpi.edu/~bermaf/Educause%20CI%20FINAL%2008.pdf (Pranshu Saxena)
• “Star is crowded by Super-Earths”, BBC (June, 2013), http://www.bbc.co.uk/news/science-environment-23032467 (Dennis Fogerty)
• “Scientists Use Stargazing Technology in the Fight against Cancer,” Time (February, 2013) http://healthland.time.com/2013/02/27/scientists-use-stargazing-technology-in-the-fight-against-cancer/ (Ted Tenedorio)
• “Digital Keys for Unlocking the Humanities’ Riches”, the New York Times, http://www.nytimes.com/2010/11/17/arts/17digital.html?pagewanted=all&_r=0 (Juan Poma)
Fran Berman, Data and Society, CSCI 4967/6963
Today: Lecture 2 Data Roundtable
• “Got Data? A Guide to Digital Preservation in the Information Age,” CACM (December, 2008) http://www.cs.rpi.edu/~bermaf/CACM08.pdf (Miguel Lantigua-Inoa)
• “A Digital Life,” Scientific American (March, 2007) http://www.scientificamerican.com/article/a-digital-life/ (Alex Karcher)
• “Thirteen Ways of Looking at … Digital Preservation,” D-Lib Magazine (August, 2004), http://www.dlib.org/dlib/july04/lavoie/07lavoie.html (Yusri Jamaluddin)
• “The Lost NASA Tapes: Restoring Lunar Images after 40 Years in the Vault”, ComputerWorld (June, 2009), http://www.computerworld.com/article/2525935/computer-hardware/the-lost-nasa-tapes--restoring-lunar-images-after-40-years-in-the-vault.html?page=2 (Robert Stephens)
• Princeton Single-Pay Storage model (2 students):
– “DataSpace: A Funding and Operational Model for LongTerm Preservation and Sharing of Research Data” White Paper (August, 2010), http://dataspace.princeton.edu/jspui/bitstream/88435/dsp01w6634361k/1/DataSpaceFundingModel_20100827.pdf (Lars Olsson)
– “Paying for Long Term Storage”, DSHR’s Blog (February, 2011), http://blog.dshr.org/2011/02/paying-for-long-term-storage.html (Oskari Rautiainen)