understanding and comparing remote sensing data to model output chris a. mattmann senior computer...

Post on 16-Jan-2016

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Understanding and Comparing Remote Sensing Data to Model Output

Chris A. MattmannSenior Computer Scientist, NASA Jet Propulsion Laboratory

Adjunct Assistant Professor, Univ. of Southern CaliforniaMember, Apache Software Foundation

Roadmap• Motivation• Background

• Earth System Grid, NASA• Inserting observations into AR5

• Why is this so difficult?• Data management issues• Architectural issues

• Approaches for dealing with observations and models• Approaches for comparing observations to models

• Architectural patterns• Example: AIRS Level 2 data to NCAR CCSM model output

• Tool support• Wrap-up

25-Mar-11 2CORDEX-MATTMANN

And you are?

• Apache Member involved in– OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator

(PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor)

• Senior Computer Scientist at NASA JPL in Pasadena, CA USA

• Software Architecture/Engineering Prof at Univ. of Southern California

25-Mar-11 3CORDEX-MATTMANN

Motivation

4

How to bring as much observational scrutiny as possible to the IPCC process?

How to best utilize the wealth of NASA Earth science information for the IPCC

process?

25-Mar-11 CORDEX-MATTMANN

Credit: Waliser, Teixeira, Crichton, Ferraro

Inserting Observations in the IPCC• Observations play a critical role in climate research

– Process understanding • Exploratory data analysis • Hypothesis formulation

– Parameterization and model development • Statistical description of sub-grid-scale processes • Hypothesis testing

– Model evaluation (IPCC) • Comparison of model output against observations • Weighting multi-model ensemble members (“scoring")

• NASA is at a critical juncture in inserting observations into AR5 – Climate research community recognizes the importance of comparing models-to-

data– The infrastructures, different formats, etc make this a challenging problem– Time, however, is limited

25-Mar-11 5CORDEX-MATTMANN

Credit: Amy Braverman

DOE Earth System Grid• Purpose

– Provide climate researchers worldwide with access to data, information, models, analysis tools, and computational resources required to make sense of enormous climate simulation datasets

• Scope– Petabyte-scale data volumes– Gateway to climate change data products, model outputs and informational sites (i.e.,

globally federated sites)– Comprehensive registry of climate change Earth Science research results and

components– Support climate change and its partner scientists, analysts, data managers, educators

and decision makers– Resource to national and international science and societal benefit initiatives– Resource to climate change data products through interoperable web service and

climate analysis tools

Credit: Dean Williams25-Mar-11 6CORDEX-MATTMANN

ESG Principal Sites

Credit: Dean Williams25-Mar-11 7CORDEX-MATTMANN

ESG Conceptual Overview

Standard Browser, Web Services

Standard Browser, Web Services

Credit: Dean Williams25-Mar-11 8CORDEX-MATTMANN

The Next-generation ESG• Independent gateways federating metadata, users. • Individual data nodes responsible for publishing services. • Designed for model output data sets.

25-Mar-11 9CORDEX-MATTMANN

ESG Gateways and Nodes• Federated architecture

– Federation is a virtual trust relationship among independent management domains that have their own set of services. Users authenticate once to gain access to data across multiple systems and organizations

• Gateways– Where data is discovered, requested– Portals, search capability, distributed metadata, registration and user management– May be customized to an institution’s requirements, topical focus– More complex architecture than nodes, fewer sites– Initially PCMDI, NCAR, ORNL, eventually GFDL

• Nodes– Where data is stored and published– Data may be on disk or tertiary mass store– Each data node can publish to any gateway (facilitates topical gateways)– Data reduction/analysis– Less complex architecture, including possible minimalist deployment w/o services– Anticipate ~20 data nodes for CMIP5, many others have expressed interest

• Sites A site can be both a gateway and a data node

Credit: Dean Williams25-Mar-11 10CORDEX-MATTMANN

NASA Distributed Active Archive Centers (DAACs)

25-Mar-11 CORDEX-MATTMANN 11

NASA Earth Science Data: Broader Picture

25-Mar-11 CORDEX-MATTMANN 12

Observations in AR5

• In AR4, the Earth System Grid played an input role in providing models for climate research

• In AR5, the ESG is being extended as a fully, distributed online data system to support access to climate models via the ESG portals

• What is needed, however, is the link to satellite observations and the convergence between the observational and modeling communities

The reliability of projections could be improved if the models were weighted according to some measure of skill. . . Since there is no verification for a climate forecast on timescales of decades to centuries, the skill or performance of the models needs to be defined, for example, by comparing simulated patterns of present day climate to observations.

Scoping of the IPCC 5th Assessment Report, IPCC Working Group, April 2009

25-Mar-11 13CORDEX-MATTMANN

Long Term Objective

• Establish a NASA-wide capability for the climate modeling community to support model-to-data intercomparison:– Ensure observations are available along-side models– Develop a common approach for sharing observations with the climate

research community– Leverage existing data systems within NASA and ESG– Ensure that NASA R&A programs have the necessary infrastructure to

support model-to-data verification and data analysis– Provide phased capabilities for AR5 and AR6

• Develop a strong collaboration between observation and modeling communities (both science and technical)– JPL and PCMDI have a very good working relationship

25-Mar-11 14CORDEX-MATTMANN

Challenges with Observational Data

• Massive– They entail detailed information about processes through multivariate

distributions on multiple spatial and temporal scales • Heterogeneous

– Have variety of organizational structures, retrieval methods, sampling characteristics, and meaning (not like model output!)

• Distributed – Are stored all over the country and the world with EOSDIS being a

principal infrastructure• Analysis

– Access and computational capabilities are needed to assemble and perform analysis “on-the-fly"

25-Mar-11 15CORDEX-MATTMANN

Traditional Paradigm

• User program must encode all functionality beyond gross-level access.• Requires knowledge of specific instrument characteristics such as retrieval

methods, format, measurement error characteristics and biases, etc. • Difficulties multiply with more than one data source.

25-Mar-11 16CORDEX-MATTMANN

Credit: Braverman, Mattmann, Crichton

Emerging Paradigm

• Push as much computation as possible to locations where the data reside; minimize data movement

• Deploy simple services to data centers that provide access and the computational functions to enable model-to-data analysis– Embrace service-oriented style of architecture

25-Mar-11 17CORDEX-MATTMANN

Credit: Braverman, Mattmann, Crichton

Science Data File Formats• Hierarchical Data Format (HDF)

– http://www.hdfgroup.org – Versions 4 and 5– Lots of NASA data is in 4, newer NASA data in 5– Encapsulates

• Observation (Scalars, Vectors, Matrices, NxMxZ…)• Metadata (Summary info, date/time ranges, spatial ranges)

– Custom readers/writers/APIs in many languages• C/C++, Python, Java

– Most NASA observational data is in HDF format

25-Mar-11 18CORDEX-MATTMANN

Science Data File Formats• network Common Data Form (netCDF)

– www.unidata.ucar.edu/software/netcdf/ – Versions 3 and 4– Heavily used in DOE, NOAA, etc.– Encapsulates

• Observation (Scalars, Vectors, Matrices, NxMxZ…)• Metadata (Summary info, date/time ranges, spatial ranges)

– Custom readers/writers/APIs in many languages• C/C++, Python, Java

– Not Hierarchical representation: all flat– Most climate model output is in netCDF

25-Mar-11 19CORDEX-MATTMANN

Tools to extract data from scientific data formats?

• There are actually quite a few that range from…– GUIs and higher level (more sophisticated) software

• R, Matlab, IDL, NCL, etc.• Intermediate APIs: NetCDF-Java, NetCDF C API, HDF4/5 API

– Low level, command-line tools• UNIX strings command

• One concern: Decimate the binary file format and give you– Metadata (Start/End date time boundaries, spatial boundaries, abstract,

investigator name, mission name, etc.)– The actual data

• Let’s take an example: Apache Tika: metadata25-Mar-11 CORDEX-MATTMANN 20

is…• A content analysis and detection toolkit• A set of Java APIs providing MIME type detection,

language identification, integration of various parsing libraries

• A rich Metadata API for representing different Metadata models

• A command line interface to the underlying Java code

• A GUI interface to the Java code• http://tika.apache.org

25-Mar-11 21CORDEX-MATTMANN

Bootstrapping

• Download Tika from:– http://tika.apache.org/download.html

• Grab tika-app-0.9.jar– http://repo1.maven.org/maven2/org/apache/tika/

tika-app/0.9/tika-app-0.9.jar

• alias tika “java –jar tika-app-0.9.jar”• tika < somefile.doc > extracted-text.xhtml• tika –m < somefile.doc > extracted.met

• Works on Windows too (alias only on UNIX)25-Mar-11 22CORDEX-MATTMANN

A quick NASA dataset

• Atmospheric Infrared Sounder Mission (AIRS)– Level 2 Cloud Clear Radiance Product– Grab it from here:

• ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS_Level2/AIRI2CCF.003/2007/005/

– Just grab the first file• java -jar tika-app-0.9.jar -m <

AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf– Hopefully this worked for you, if not, blame Bruce

• And windows– And Bill Gates

25-Mar-11 CORDEX-MATTMANN 23

So you can get info from the file, what to do with it?

• You guys know plenty more about that than me! • However…

– Let’s take an example where we want to extract a time series of temp. profile information from AIRS level 2 datasets

• …and then, to compare it with model output from the NCAR Community Climate System Model (CCSM)

• Compare meaning compute some statistic, e.g., let’s say averages that we can then compare between measured and predicted values

25-Mar-11 CORDEX-MATTMANN 24

Some initial parameters• AIRS Level 2 Standard Products

– HDF4, with HDF-EOS metadata– Housed in several places

• AIRS TLSCF (JPL, Pasadena, West Coast) ,NASA GES DISC (Goddard, Maryland, East Coast)

• NCAR CCSM model output– NetCDF, with CF metadata– Housed in several places, canonical source is the Earth

System Grid• Lawrence Livermore National Laboratory (LLNL), Livermore, CA

25-Mar-11 CORDEX-MATTMANN 25

What’s the process?

25-Mar-11 CORDEX-MATTMANN 26

Step 1: AIRS data

• Decide on some set of AIRS data to select– Time bounds (e.g., January 2007)– Spatial bounds (lat lon box)

• Understand AIRS data– 240 files per day, broken down into 6 minute

granules– Each file is in HDF4 format, with measured values

for each variable part of the Level 2 std product– Understand the variable name: TAirStd

25-Mar-11 CORDEX-MATTMANN 27

Step 1a: Obtain AIRS data

• Some options– Go to the GES DISC and get the AIRS data from their

FTP server – boo!– Get just the AIRS data

you need from a web service (OPeNDAP) i.e., subset it – better!

• Subset out the TAirStd 45x30 matrix, and only the part of that matrix that you care about that corresponds to your spatial region of interest

• Requires that you know what variable is used for lat, lon, and time (stored in separate 45x30 matrices)

25-Mar-11 CORDEX-MATTMANN 28

Step 1b:

• So you’ve got 240 * 31 files = 7440 files• Each one of these is pretty big (order of

gigabytes)– Let’s assume 2 GB per file– That would mean you need ~1.5 TB of space just

to get your obs data – eeep!

• Better idea:– Many of those 7440 files aren’t over your region

of interest so discard the ones that aren’t25-Mar-11 CORDEX-MATTMANN 29

What’s the process?

25-Mar-11 CORDEX-MATTMANN 30

Step 2

• Given a subset list of those 7440 files (let’s say 1500 or so)

• For each file– Subset out each TAirStd 45x30 matrix from the file

(and believe it or not you may not even need all of those 45 x 30 matrices either), which results in a set of data points X = (v)

– Subset out lat, lon and time and shove them into the corresponding value to yield a 4-tuple

• X = (v, t, lat, lon)25-Mar-11 CORDEX-MATTMANN 31

Step 2a

• Hidden assumption– Step 2 is easy– IT’S NOT

• In fact, Step 2 is usually one of the hardest parts since not all of these NASA or NOAA datasets include a subset function

• The datasets themselves may have different temporal properties (compared to models)– AIRS data relevant only at 1:30am and 1:30pm

• Different spatial properties too: 500m level25-Mar-11 CORDEX-MATTMANN 32

Sample GHRSST L2 Data Set ImageNotice that the lines of longitude and latitude are not perfectly straight. This makes it more difficult to locate equator crossings.

25-Mar-11 CORDEX-MATTMANN 33

What’s the process?

25-Mar-11 CORDEX-MATTMANN 34

Step 3

• Given a set of data point tuples X = (v, t, lat, lon)– Build up a cube of the form lon by lat by time– “Regrid” the resultant satellite data onto this cube– Make this cube match up to the gridding

properties of your model• Maybe 1 deg by 1 deg grid box over the area that you

care about• Maybe daily, monthly, hourly: your model will dictate

this!25-Mar-11 CORDEX-MATTMANN 35

Step 3a

• Given a satellite data “regridded cube”, it’s fairly trivial to compute stats on that cube that matches up to the model– Averages/time – sum lat/lon 2d sheet for each

sheet over time (the z axis in the cube)– Means/time – derive mean for lat/lon 2d sheet

over time (the z axis in the cube)– Etc etc

25-Mar-11 CORDEX-MATTMANN 36

OK the schedule says I’ll talk about a tool

25-Mar-11 CORDEX-MATTMANN 37

TRMMTRMM

ERA-IntERA-Int

MODISMODIS

CRUCRU

RCMED

Observation database

RCMED

Observation database

AIRSAIRS

Ext

ract

ors

www

RCMET

Evaluation tool front-

end

RCMET

Evaluation tool front-

end

Model file

Model file

client-side (user’s local machine)

server-side (hosted at JPL)

RCMED

25-Mar-11 38CORDEX-MATTMANN

• MODIS (satellite cloud fraction): [daily 2000 – 2010]• TRMM (satellite precipitation): 3B42 [daily 1998–

2010]• AIRS (satellite surface + profile retrievals) [daily

2002 – 2010]

• ERA-Interim (reanalysis): [daily 1989 – 2010]

• NCEP Unified Rain gauge Database (gridded precipitation): [daily 1948 – 2010]

• CRU TS 3.0: precipitation, Tavg, Tmax, Tmin [monthly 1901 – 2006]

Level 3: T(2m), T(p), z(p)

T(2m), Td(2m), T(p), z(p)

Datasets included

25-Mar-11 39CORDEX-MATTMANN

TRMMTRMM

ERA-IntERA-Int

MODISMODIS

CRUCRU

RCMED

Observation database

RCMED

Observation database

AIRSAIRS

Ext

ract

ors

www

RCMET

Evaluation toolkit

RCMET

Evaluation toolkit

Model file

Model file

client-side (user’s local machine)

server-side (hosted at JPL)

How do RCMET and RCMED talk?

25-Mar-11 40CORDEX-MATTMANN

Programmatic Access

The RCMED API:

- Search the entire database - Space/Time box

- Simple RESTful URL- Simple ASCII result

format

25-Mar-11 41CORDEX-MATTMANN

Recall: this would be what you need for step 2.5

25-Mar-11 CORDEX-MATTMANN 42

RCMED Web-Based AccessThe RCMED Data Portal:

- Database Statistics

- Project information

- Advanced search options

- Data product download

- Query API for 3rd Party Scripts

25-Mar-11 43CORDEX-MATTMANN

TRMMTRMM

ERA-IntERA-Int

MODISMODIS

CRUCRU

RCMED

Observation database

RCMED

Observation database

AIRSAIRS

Ext

ract

ors

www

RCMET

Evaluation tool front-

end

RCMET

Evaluation tool front-

end

Model file

Model file

client-side (user’s local machine)

server-side (hosted at JPL)

RCMET

25-Mar-11 44CORDEX-MATTMANN

Collect User Choices (GUI / command line)Collect User Choices (GUI / command line)

Load model data

Load model data

Retrieve obs from

database

Retrieve obs from

databaseSpatial re-gridding onto common grid

Spatial re-gridding onto common gridTime averagingTime averaging

Area -averaging

Area -averaging

Annual cycle compositingAnnual cycle compositing

Metric Calculation

Metric Calculation

Plot production

Plot production

Model

file

Model

fileR

CM

ET

opti

onal

e.g. calculate monthly means from daily data

e.g. calculate monthly means from daily data

e.g. calculate area-weighted mean over

user defined masked region

e.g. calculate area-weighted mean over

user defined masked region

e.g. calculate means of all Januarys, all

Februarys etc

e.g. calculate means of all Januarys, all

Februarys etc

e.g. calculate bias, RMS error etc

e.g. calculate bias, RMS error etc

e.g. map, time series plot, Taylor diagrame.g. map, time series plot, Taylor diagram

RCMED

Observation database

RCMED

Observation database

25-Mar-11 45CORDEX-MATTMANN

• Annual cycle compositing• Area-averaging:

• Full domain• User defined lat/lon bounding box• User supplied mask in netCDF file

• Metrics:• Mean error (bias), RMS error, Mean Absolute Error, Pattern

Correlation, Anomaly Correlation, Probability Distribution Function

• Plots:• Time series• Map plots• Taylor Diagram

What we’re working on

25-Mar-11 46CORDEX-MATTMANN

Demo

• If this doesn’t work I have backup slides– Cross your fingers– And if it doesn’t work, I blame Bruce, Chris,

Richard, Bill, Hassan et al. for keeping me out last night

25-Mar-11 CORDEX-MATTMANN 47

Lessons Learned

• Separating out RCMED and RCMET– = GOOD– Allows for each to evolve independently

• Keep adding satellite observations, analysis tool just reaps the benefits without having to know or care about formats, temporal differences, spatial differences, etc.

• RCMET installation on client machine– …ehhh, not always so good– RCMET has a tightly coupled dep on RCMED

25-Mar-11 CORDEX-MATTMANN 48

Thoughts

• Bandwidth limited in Africa• Option 1: Couple RCMED and RCMET-like system

closely together– Stand up RCMES (coupled system)– Easily add new datasets, new plots, new stats, etc.– Bandwidth limitation more easily dealt with due to

closeness

• Option 2: Provision RCMES as a web-ui near a data center with lots of bandwidth– Allows for true “thinlet” apps, either browser or phone

25-Mar-11 CORDEX-MATTMANN 49

Alright, I’ll shut up now

• Any questions?

• THANK YOU!– mattmann@apache.org– chris.a.mattmann@nasa.gov – @chrismattmann on Twitter

25-Mar-11 50CORDEX-MATTMANN

Acknowledgements

• CORDEX Team• For inviting us out here, thank you!

• NASA Jet Propulsion Laboratory– RCMES Team– CDX Team– OODT Team

• Andrew Hart, Peter Lean, Cameron Goodale, Jinwon Kim, Dan Crichton, Duane Waliser, Amy Braverman

25-Mar-11 51CORDEX-MATTMANN

Backup

25-Mar-11 CORDEX-MATTMANN 52

25-Mar-11 53CORDEX-MATTMANN

25-Mar-11 54CORDEX-MATTMANN

25-Mar-11 55CORDEX-MATTMANN

25-Mar-11 56CORDEX-MATTMANN

25-Mar-11 57CORDEX-MATTMANN

25-Mar-11 58CORDEX-MATTMANN

25-Mar-11 59CORDEX-MATTMANN

25-Mar-11 60CORDEX-MATTMANN

25-Mar-11 61CORDEX-MATTMANN

25-Mar-11 62CORDEX-MATTMANN

25-Mar-11 63CORDEX-MATTMANN

25-Mar-11 64CORDEX-MATTMANN

25-Mar-11 65CORDEX-MATTMANN

25-Mar-11 66CORDEX-MATTMANN

25-Mar-11 67CORDEX-MATTMANN

25-Mar-11 68CORDEX-MATTMANN

25-Mar-11 69CORDEX-MATTMANN

25-Mar-11 70CORDEX-MATTMANN

[K]

25-Mar-11 71CORDEX-MATTMANN

[K]

25-Mar-11 72CORDEX-MATTMANN

[K]

25-Mar-11 73CORDEX-MATTMANN

top related