unlocking open data in the cloud
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sharing Planetary-Scale Open Data on AWS
Jed Sundwall, Global Open Data LeadAWS Pop-up Loft – London, 21 April 2016
Why does AWS care about Open Data?
§ Many of our commercial sector customers rely on quality open data as much as they rely on our cloud infrastructure services.
§ Many of our public sector customers use AWS to make their data available to a global community of researchers, entrepreneurs, students, and fellow government agencies.
2
Sharing data on AWS makes it accessible to a large and growing community of researchers, entrepreneurs, and enterprises who use the AWS cloud.
Big Data: Collaboration in the Cloud The Big Data ChallengeTraditionally, it has been time consuming and expensive to acquire, store, and analyze large data sets.
3
Our Solution – Shared Open Data on AWSWhen data is shared on AWS, users can work with any amount of data without needing to download it or store their own copies.
When heavy data like Earth imagery is available in AWS's cloud, the time required to copy the data to a virtual server for analysis is virtually eliminated.
Traditional Data Acquisition
4
“…data must be organized, well-documented, consistently formatted, and error free. Cleaning the data is often the most taxing part of data science, and is frequently 80% of the work.”
We ask: How can we get rid of that 80%?Bandwidth and infrastructure constraints slow down research and development. New and better sensors are exacerbating this problem.
- Data Driven by DJ Patil and Hilary Mason
Tape Data Center Disk Server Client
The cloud allows researchers from anywhere to take their algorithms to data rather than downloading data to their computing resources.
Data Acquisition in the Cloud
5
When data is shared in the cloud, anyone can analyze it without needing to download it or store it themselves.
“Ordinarily, hitting ‘copy’ on a 4 gigabyte file is an opportunity to stand up and get a fresh cup of coffee, browse the sports section for a little while, but moving data between servers in an Amazon data center barely affords time to touch your toes a couple times.” — Paul Ramsey
“ “
Source: http://s3.cleverelephant.ca.s3.amazonaws.com/2015-ccog.pdf
Open data as a platform
Data Creation Data Enrichment
Sens
emak
ing
Data at Rest(Object storage)
Basic APIs
Complex APIs
Consumerapplications
Algorithmicpolicy
Data-drivenjournalism
Data Catalogs
Focused datadashboards
Predictivemodeling
Visualizations
Lower cost of knowledge(Efficiency)
6
Data Enrichment
Sen
sem
akin
g
AmazonKinesis
AmazonEC2
AmazonEC2
AWS DataPipeline
AmazonS3
AmazonRDS
AmazonEMR
AmazonRedshift
AmazonDynamoDB
AWSLambda
Open data as a platform
7
Public Data Sets on AWSSeveral high-value datasets are available for anyone to access for free on AWS. Examples include:
8
Landsat on AWS3K Rice Genome NEXRAD on AWS
Landsat on AWS
We have committed to make up to 1 petabyte of Landsat imagery readily available as objects on Amazon S3.
All Landsat 8 scenes from 2015 and 2016 are available, along with a selection of cloud-free scenes from 2013 and 2014.
All new Landsat 8 scenes are made available each day (~700 per day), often within hours of production.
9
RGBVisible light
InfraredVegetation
Shortwave infraredUrban areas
Think of URLs instead of copiesWellington, New Zealandhttps://landsat-pds.s3.amazonaws.com/L8/072/089/
Landsat – Not Just Pretty Pictures
Landsat scenes are made up of multiple files, each of which includes data about different kinds of light reflected off of Earth.
The Landsat program has been in operation since 1972.
Each pixel of each Landsat 8 file represents a 12-bit measurement of light reflected off a 30m2 part of our planet.
Each Landsat 8 scene contains about 840 million pixels and takes up about 800 MB.
We’re currently hosting over 128,000 Landsat 8 scenes and make about 700 new scenes available on S3 every day.
Landsat – Not Just Pretty Pictures
Landsat on AWS
Landsat on AWS makes each band of each scene readily available as objects on Amazon S3. Data can be accessed programmatically via HTTP and quickly deployed to any of our products for analysis and processing.
Users do not need to worry about local storage and have access to virtually unlimited computing power on demand.
13
AmazonEC2
s3://landsat-pds
.tarUSGS
.tiff
Landsat on AWS: UsageIn the first year (19 Mar 2015 – 19 Mar 2016)
§ Over 400,000 scenes available
§ Over 1 billion hits globally
Image shows frequency of scene requests by path/row.
White: ~100 requestsOrange: >300k requests
14
Visualization by Drew BollingerDevelopment Seed
MATLAB – Landsat8 Data Explorer
MathWorks created a freely downloadable MATLAB based tool for accessing, processing, and visualizing Landsat 8 data.
The tool allows MATLAB users to find Landsat 8 scenes, analyze them, and combine them with other sources of GIS data for new visualizations.
http://blogs.mathworks.com/steve/2015/03/19/matlab-landsat-8-aws/
15
Esri – Unlock Earth’s Secrets
Esri has created a tool to show how ArcGIS Online can quickly visualize Landsat data for live visualization and analysis within the browser.
“These are not pre-generated cache services limited to just visualization—they are dynamic, high-performance image services that perform on-the-fly processing and dynamic mosaicking of Landsat’s multi-spectral and multi-temporal imagery.”
http://www.esri.com/landsatonaws
https://developmentseed.org/blog/2015/03/19/aws-landsat-archive/
landsat-util
Landsat on AWS helped Development Seed make optimizations that make landsat-util over 2× faster and allow for more functionality.
17
Landsat-live
Mapbox created Landsat-live, a map that is constantly refreshed with the latest satellite imagery from NASA’s Landsat 8 satellite.
Creating a live Earth imagery pipeline is possible because Landsat imagery is available on Amazon S3 within hours of creation.
https://www.mapbox.com/blog/landsat-live-live/
Snapsat
A team of five novice programmers used Landsat on AWS to build a web app called Snapsat that creates Landsat data visualizations in seconds.
Snapsat was built during the team’s 8-week training program at Code Fellows. They launched it just months after learning to write code.
http://snapsat.org
19
Landsat on AWS as a platform
20
The NOAA Big Data Project
21
Amazon Web Services has entered into a research agreement with the US National Oceanic and Atmospheric Administration (NOAA) to explore sustainable models to increase the output of open NOAA data.
Under this new agreement, AWS and its collaborators are exploring ways to make NOAA data easier to access and use.
NEXRAD on AWS
22
The Next Generation Weather Radar (NEXRAD) is a network of 160 high-resolution Doppler radar sites that detects precipitation and atmospheric movement and disseminates data in 5 minute intervals from each site.
It has traditionally been time consuming and expensive to acquire, store, and analyze NEXRAD data. Accessing the full historical archive has been impossible.
NEXRAD on AWS – RESTful InterfaceNEXRAD on AWS makes 270TB of individual volume scan files and real-time chunks freely available on Amazon S3.
§ Data can be accessed programmatically via a RESTful interface and quickly deployed to any of our products for analysis and processing.
23
NEXRAD on AWS – Event-driven AnalysisNEXRAD on AWS makes 270TB of individual volume scan files and real-time chunks freely available on Amazon S3.
§ Amazon Simple Notification Service (SNS) allows subscription to notifications of new data.
24
NEXRAD on AWS: Early Use Cases
§ Climate Corporation cut two weeks out of an analysis pipeline and are incorporating the real-time feed into their production workflows.
§ A weather data company stopped storing their own NEXRAD archive, freeing up revenue to build new products.
§ Several weather companies developing new products based on the archival data and real-time feed.
25
NEXRAD on AWS has already reduced the cost of research and product development.
§ Unidata has made the archive data available via an AWS-hosted THREDDS Data Server to users with .edu domains.
§ Several researchers are planning longitudinal analyses using the full archive.
§ An open sensor data standards group is evaluating the NEXRAD on AWS model.
What have we learned?
26
Opening Data is Not Enough
File Names Matter
never_gonna_give_you_up.mp3
File Names Matter
never_gonna_give_you_up.mp3
Sorry!
File Names Matter
never_gonna_give_you_up.mp3
Sorry!
Not really. :)
File Names Matter
L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B2.TIF
File Names Matter
L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B2.TIF
Sensor
Path Row Year
Day of year
Band
HTTP Range Gets and Headers are Cool
https://sgillies.net/2016/04/05/rasterio-0-34.html
This call fetches only 16,384 bytes of a 50MB TIFF to get the metadata printed above. That’s an efficiency of 3000:1.
Requester Pays
With Requester Pays buckets, the requester instead of the bucket owner pays the cost of the request and the data download from the bucket.
Driving innovation with open data
To drive innovation with your data, make sure:
1. It’s accurate
2. It’s documented
3. It will still be available to developers tomorrow
These are the hard parts. Storage and delivery shouldn’t be.
Nicolas TerpolilliOpenDataSoft
Up next: