unlocking open data in the cloud

37
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sharing Planetary-Scale Open Data on AWS Jed Sundwall, Global Open Data Lead AWS Pop-up Loft – London, 21 April 2016

Upload: amazon-web-services

Post on 15-Apr-2017

461 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Unlocking Open Data in the Cloud

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Sharing Planetary-Scale Open Data on AWS

Jed Sundwall, Global Open Data LeadAWS Pop-up Loft – London, 21 April 2016

Page 2: Unlocking Open Data in the Cloud

Why does AWS care about Open Data?

§ Many of our commercial sector customers rely on quality open data as much as they rely on our cloud infrastructure services.

§ Many of our public sector customers use AWS to make their data available to a global community of researchers, entrepreneurs, students, and fellow government agencies.

2

Sharing data on AWS makes it accessible to a large and growing community of researchers, entrepreneurs, and enterprises who use the AWS cloud.

Page 3: Unlocking Open Data in the Cloud

Big Data: Collaboration in the Cloud The Big Data ChallengeTraditionally, it has been time consuming and expensive to acquire, store, and analyze large data sets.

3

Our Solution – Shared Open Data on AWSWhen data is shared on AWS, users can work with any amount of data without needing to download it or store their own copies.

When heavy data like Earth imagery is available in AWS's cloud, the time required to copy the data to a virtual server for analysis is virtually eliminated.

Page 4: Unlocking Open Data in the Cloud

Traditional Data Acquisition

4

“…data must be organized, well-documented, consistently formatted, and error free. Cleaning the data is often the most taxing part of data science, and is frequently 80% of the work.”

We ask: How can we get rid of that 80%?Bandwidth and infrastructure constraints slow down research and development. New and better sensors are exacerbating this problem.

- Data Driven by DJ Patil and Hilary Mason

Tape Data Center Disk Server Client

Page 5: Unlocking Open Data in the Cloud

The cloud allows researchers from anywhere to take their algorithms to data rather than downloading data to their computing resources.

Data Acquisition in the Cloud

5

When data is shared in the cloud, anyone can analyze it without needing to download it or store it themselves.

“Ordinarily, hitting ‘copy’ on a 4 gigabyte file is an opportunity to stand up and get a fresh cup of coffee, browse the sports section for a little while, but moving data between servers in an Amazon data center barely affords time to touch your toes a couple times.” — Paul Ramsey

“ “

Source: http://s3.cleverelephant.ca.s3.amazonaws.com/2015-ccog.pdf

Page 6: Unlocking Open Data in the Cloud

Open data as a platform

Data Creation Data Enrichment

Sens

emak

ing

Data at Rest(Object storage)

Basic APIs

Complex APIs

Consumerapplications

Algorithmicpolicy

Data-drivenjournalism

Data Catalogs

Focused datadashboards

Predictivemodeling

Visualizations

Lower cost of knowledge(Efficiency)

6

Page 7: Unlocking Open Data in the Cloud

Data Enrichment

Sen

sem

akin

g

AmazonKinesis

AmazonEC2

AmazonEC2

AWS DataPipeline

AmazonS3

AmazonRDS

AmazonEMR

AmazonRedshift

AmazonDynamoDB

AWSLambda

Open data as a platform

7

Page 8: Unlocking Open Data in the Cloud

Public Data Sets on AWSSeveral high-value datasets are available for anyone to access for free on AWS. Examples include:

8

Landsat on AWS3K Rice Genome NEXRAD on AWS

Page 9: Unlocking Open Data in the Cloud

Landsat on AWS

We have committed to make up to 1 petabyte of Landsat imagery readily available as objects on Amazon S3.

All Landsat 8 scenes from 2015 and 2016 are available, along with a selection of cloud-free scenes from 2013 and 2014.

All new Landsat 8 scenes are made available each day (~700 per day), often within hours of production.

9

Page 10: Unlocking Open Data in the Cloud

RGBVisible light

InfraredVegetation

Shortwave infraredUrban areas

Think of URLs instead of copiesWellington, New Zealandhttps://landsat-pds.s3.amazonaws.com/L8/072/089/

Page 11: Unlocking Open Data in the Cloud

Landsat – Not Just Pretty Pictures

Landsat scenes are made up of multiple files, each of which includes data about different kinds of light reflected off of Earth.

The Landsat program has been in operation since 1972.

Page 12: Unlocking Open Data in the Cloud

Each pixel of each Landsat 8 file represents a 12-bit measurement of light reflected off a 30m2 part of our planet.

Each Landsat 8 scene contains about 840 million pixels and takes up about 800 MB.

We’re currently hosting over 128,000 Landsat 8 scenes and make about 700 new scenes available on S3 every day.

Landsat – Not Just Pretty Pictures

Page 13: Unlocking Open Data in the Cloud

Landsat on AWS

Landsat on AWS makes each band of each scene readily available as objects on Amazon S3. Data can be accessed programmatically via HTTP and quickly deployed to any of our products for analysis and processing.

Users do not need to worry about local storage and have access to virtually unlimited computing power on demand.

13

AmazonEC2

s3://landsat-pds

.tarUSGS

.tiff

Page 14: Unlocking Open Data in the Cloud

Landsat on AWS: UsageIn the first year (19 Mar 2015 – 19 Mar 2016)

§ Over 400,000 scenes available

§ Over 1 billion hits globally

Image shows frequency of scene requests by path/row.

White: ~100 requestsOrange: >300k requests

14

Visualization by Drew BollingerDevelopment Seed

Page 15: Unlocking Open Data in the Cloud

MATLAB – Landsat8 Data Explorer

MathWorks created a freely downloadable MATLAB based tool for accessing, processing, and visualizing Landsat 8 data.

The tool allows MATLAB users to find Landsat 8 scenes, analyze them, and combine them with other sources of GIS data for new visualizations.

http://blogs.mathworks.com/steve/2015/03/19/matlab-landsat-8-aws/

15

Page 16: Unlocking Open Data in the Cloud

Esri – Unlock Earth’s Secrets

Esri has created a tool to show how ArcGIS Online can quickly visualize Landsat data for live visualization and analysis within the browser.

“These are not pre-generated cache services limited to just visualization—they are dynamic, high-performance image services that perform on-the-fly processing and dynamic mosaicking of Landsat’s multi-spectral and multi-temporal imagery.”

http://www.esri.com/landsatonaws

Page 17: Unlocking Open Data in the Cloud

https://developmentseed.org/blog/2015/03/19/aws-landsat-archive/

landsat-util

Landsat on AWS helped Development Seed make optimizations that make landsat-util over 2× faster and allow for more functionality.

17

Page 18: Unlocking Open Data in the Cloud

Landsat-live

Mapbox created Landsat-live, a map that is constantly refreshed with the latest satellite imagery from NASA’s Landsat 8 satellite.

Creating a live Earth imagery pipeline is possible because Landsat imagery is available on Amazon S3 within hours of creation.

https://www.mapbox.com/blog/landsat-live-live/

Page 19: Unlocking Open Data in the Cloud

Snapsat

A team of five novice programmers used Landsat on AWS to build a web app called Snapsat that creates Landsat data visualizations in seconds.

Snapsat was built during the team’s 8-week training program at Code Fellows. They launched it just months after learning to write code.

http://snapsat.org

19

Page 20: Unlocking Open Data in the Cloud

Landsat on AWS as a platform

20

Page 21: Unlocking Open Data in the Cloud

The NOAA Big Data Project

21

Amazon Web Services has entered into a research agreement with the US National Oceanic and Atmospheric Administration (NOAA) to explore sustainable models to increase the output of open NOAA data.

Under this new agreement, AWS and its collaborators are exploring ways to make NOAA data easier to access and use.

Page 22: Unlocking Open Data in the Cloud

NEXRAD on AWS

22

The Next Generation Weather Radar (NEXRAD) is a network of 160 high-resolution Doppler radar sites that detects precipitation and atmospheric movement and disseminates data in 5 minute intervals from each site.

It has traditionally been time consuming and expensive to acquire, store, and analyze NEXRAD data. Accessing the full historical archive has been impossible.

Page 23: Unlocking Open Data in the Cloud

NEXRAD on AWS – RESTful InterfaceNEXRAD on AWS makes 270TB of individual volume scan files and real-time chunks freely available on Amazon S3.

§ Data can be accessed programmatically via a RESTful interface and quickly deployed to any of our products for analysis and processing.

23

Page 24: Unlocking Open Data in the Cloud

NEXRAD on AWS – Event-driven AnalysisNEXRAD on AWS makes 270TB of individual volume scan files and real-time chunks freely available on Amazon S3.

§ Amazon Simple Notification Service (SNS) allows subscription to notifications of new data.

24

Page 25: Unlocking Open Data in the Cloud

NEXRAD on AWS: Early Use Cases

§ Climate Corporation cut two weeks out of an analysis pipeline and are incorporating the real-time feed into their production workflows.

§ A weather data company stopped storing their own NEXRAD archive, freeing up revenue to build new products.

§ Several weather companies developing new products based on the archival data and real-time feed.

25

NEXRAD on AWS has already reduced the cost of research and product development.

§ Unidata has made the archive data available via an AWS-hosted THREDDS Data Server to users with .edu domains.

§ Several researchers are planning longitudinal analyses using the full archive.

§ An open sensor data standards group is evaluating the NEXRAD on AWS model.

Page 26: Unlocking Open Data in the Cloud

What have we learned?

26

Page 27: Unlocking Open Data in the Cloud

Opening Data is Not Enough

Page 28: Unlocking Open Data in the Cloud

File Names Matter

never_gonna_give_you_up.mp3

Page 29: Unlocking Open Data in the Cloud

File Names Matter

never_gonna_give_you_up.mp3

Sorry!

Page 30: Unlocking Open Data in the Cloud

File Names Matter

never_gonna_give_you_up.mp3

Sorry!

Not really. :)

Page 31: Unlocking Open Data in the Cloud

File Names Matter

L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B2.TIF

Page 32: Unlocking Open Data in the Cloud

File Names Matter

L8/139/045/LC81390452014295LGN00/LC81390452014295LGN00_B2.TIF

Sensor

Path Row Year

Day of year

Band

Page 33: Unlocking Open Data in the Cloud

HTTP Range Gets and Headers are Cool

https://sgillies.net/2016/04/05/rasterio-0-34.html

This call fetches only 16,384 bytes of a 50MB TIFF to get the metadata printed above. That’s an efficiency of 3000:1.

Page 34: Unlocking Open Data in the Cloud

Requester Pays

With Requester Pays buckets, the requester instead of the bucket owner pays the cost of the request and the data download from the bucket.

Page 35: Unlocking Open Data in the Cloud

Driving innovation with open data

To drive innovation with your data, make sure:

1. It’s accurate

2. It’s documented

3. It will still be available to developers tomorrow

These are the hard parts. Storage and delivery shouldn’t be.

Page 36: Unlocking Open Data in the Cloud

Thank you!

[email protected]

Jed Sundwall, Global Open Data Lead

Page 37: Unlocking Open Data in the Cloud

Nicolas TerpolilliOpenDataSoft

Up next: