a. m. ellison et al. - 1laser.cs.umass.edu/techreports/04-79.pdf · a. m. ellison et al. - 2...

A. M. Ellison et al. - 1

AN ANALYTIC WEB TO SUPPORT THE ANALYSIS AND SYNTHESIS OF ECOLOGICAL DATA

Aaron M. Ellison,1,3 Leon J. Osterweil,2 Julian L. Hadley,1 Alexander Wise,2 Emery Boose,1

Lori Clarke,2 David R. Foster,1 Allen Hanson,2 David Jensen,2

Paul Kuzeja,1 Edward Riseman,2 and Howard Schultz2

1Harvard University, Harvard Forest, P.O. Box 68, Petersham, Massachusetts 01366, USA

2University of Massachusetts, Department of Computer Science, Computer Science Building,

Amherst, MA 01003

3Author for correspondence:Aaron M. [email protected]: 978-724-3302fax: 978-724-3595

For submission to: EcologyType of submission: Concepts & SynthesesNumber of words: 10,050Number of tables: 2Number of figures: 7Date of submission: 8 September 2004


Abstract. Synthesis of ecological data is a major vehicle for addressing questions that span a

wide range of spatial and temporal scales. Useful syntheses should be reliable, reproducible, and

repeatable. Such synthesis requires access to data, accurate and precise descriptions of those data

(metadata), and documentation of the analytical processes used to derive usable datasets from

collections of raw data that may be massive and may include real-time data streams. We describe

a formal representation – an analytic web – that provides clear, complete, and precise definitions

of scientific processes used to process raw and derived data. The formalisms used to define

analytic webs are adaptations of those used in software development, and they provide a novel

and effective support system for the synthesis of ecological data. We illustrate the utility of an

analytic web through the analysis and synthesis of long-term measurements of whole-ecosystem

carbon exchange, and in the construction of a complete, internet-accessible audit trail of the

analytic processes used to determine the effect of varying model parameters on estimation of

nighttime carbon flux. The analytic web not only increased the ease and speed of processing and

synthesizing ecological data but also provided a great leap forward in communicating the results

of our efforts. Our software toolset, SciWalker, can be used to create a variety of analytic

webs, including webs of webs, tailored to specific data synthesis activities. The process metadata

created by SciWalker is readily adapted for inclusion in Ecological Metadata Language

(EML) files.

Key words: Analytic web; eddy covariance; EML; metadata; synthesis; SciWalker; XML


INTRODUCTION

Examining complex questions, integrating information from a variety of disciplines, and

testing hypotheses at multiple spatial and temporal scales account for an increasing proportion of

ecological and environmental research (e.g., Michener et al. 2001, Andelman et al. 2004). Such

syntheses can help to identify and address the “big” ecological questions (Lubchenco et al. 1991,

Belovsky et al. 2004) and contribute to the setting of local, regional, national, and global

environmental policies (Schemske et al. 1994, IPCC 2001, Kareiva 2002). Synthesis is the

raison d’etre of the National Center for Ecological Analysis and Synthesis (NCEAS); barely a

month passes without the appearance of one or more publications by NCEAS working groups.

At its heart, synthesis involves intellectual creativity – asking crosscutting questions and

confronting existing paradigms from new standpoints. But without interpretable and well-

documented data (Michener et al. 1997, Jones et al. 2001), meaningful synthesis cannot proceed

(Andelman et al. 2004). Fortunately, efforts are underway to develop data documentation

(metadata) tools (Michener 2000, Helly et al. 2002) that are intended to increase the chances that

publicly available datasets are used appropriately.

These are pressing issues. Funding agencies increasingly mandate that datasets obtained

with public funds be made available via the Internet with few restrictions. In general, it is up to

individual investigators to meet federal laws and the directives of funding agencies. The

Ecological Society of America makes allowances for researchers to archive data associated with

papers published in its journals1, but has no codified policy on permanent data accessibility or

1 Ecological Archives: http://esapubs.org/Archive/


archiving. In contrast, the Long-Term Ecological Research (LTER) network insists that the vast

majority of data collected at LTER sites be made available within three years of collection.2

Ecologists not only create synthetic datasets from existing datasets, but also increasingly

rely on real-time or near real-time data streams from automated sources. Examples include

climatic data, satellite imagery, measurements of energy, nutrient or gas fluxes, gene sequences,

and data expected from the nascent National Ecological Observatory Network (NEON). Both

synthesized data compilations and real-time data streams generate enormous datasets (Table 1)

that present novel challenges for visualization, interpretation, and analysis (see, for example

(Wegman 1995, Buja et al. 1996, Billard and Diday 2003). These challenges are posed not so

much by storage needs – the current generation of desktop computers can store gigabytes or

terabytes of data – as by difficulties arising from the software domain. The most obvious

challenges are those imposed by the costs of processing, visualizing, transporting, and analyzing

the data, many of which scale polynomially or exponentially with dataset size (Wegman 1995).

A less obvious but perhaps more important challenge is to assure that the analytical

processing of the data is sufficiently well documented that subsequent users of processed data:

(1) do not misuse or misinterpret the data; (2) can repeat the analyses with the same or

alternative datasets; or (3) can knowingly apply variants of the processes. Increasingly, data are

readily available to users worldwide, and the risk that such data or their analyses will be misused

is growing rapidly. It is a potential boon to science that collaborations among geographically

distributed researchers can be facilitated by easy access to scientific datasets, and this boon

creates unprecedented opportunities for participation in the active conduct of science by large

2 http://www.lternet.edu/data/netpolicy.html


groups of individuals and communities (e.g., the Globus3 and GriPhyN4 projects). Indeed,

distributed data processing among hundreds or thousands of computers linked by the Internet has

significantly accelerated analysis of very large datasets (Foster and Kesselman 1999, Reinefeld

and Schintke 2002, Aloisio and Cafaro 2003, Krishnan 2004, Li et al. 2004), leading to faster

and better scientific investigations.

Such expedited access to data creates opportunities for comprehensive analysis and

synthesis. However, it also increases uncertainties in the analysis (Regan et al. 2002) that

highlight three main concerns that we are addressing with the work described here. These

concerns are (1) the reliability of scientific datasets and their associated results; (2) the

reproducibility of scientific results; and (3) the repeatability of these results. While we focus here

on analysis of and documentation of processes applied to large datasets, we emphasize that the

techniques we describe apply to datasets of any size.

WHAT, WE WORRY?

It is important to recognize that the datasets on which scientific publications are based

often are far removed from the raw data collected in the laboratory or in the field. In general,

these datasets are derived from a sequence of scientific processes, including: sampling; data

checking and cleaning (quality assurance/quality control, or QA/QC); variable transformations;

statistical model construction; and statistical inference and evaluation (Gotelli and Ellison 2004).

Thus, our first concern is that highly processed datasets and the results derived from them may

3 http://www.globus.org4 http://www.griphyn.org


be unreliable because incorrect or misleading conclusions were drawn from mishandled or

misinterpreted datasets.

Even when metadata are available, complete details of the actual processes, including

software used in the analyses and specifics of the analytical routines and algorithms used,

generally are not available or are poorly documented. Scientists often carry out complex sets of

processes in creating usable or publishable data. But if other scientists cannot determine

definitively the ways in which these processes were executed, there is a strong possibility that

subsequent use of the data will result in uncertain, incorrect, or misleading conclusions (Esser et

al. 2000). The problem of dealing with inappropriate use of data and scientific processes is

challenging even within a team of collaborating scientists in a single laboratory. As wide access

to scientific data increases, the risk of incorrect or misleading conclusions drawn from them will

increase. For example, historical maps are routinely digitized (e.g., Hall et al. 2002), and the

resulting computer-processed (GIS) maps may look so accurate and believable that they are used

uncritically despite the existence of metadata that include accuracy assessments and disclaimers.

Similarly, the uncertainty inherent in ecological forecasting, including predictions of climate

change and resource depletion, allows for a wide range of mutually incompatible conclusions

(IPCC 2001, Lomborg 2001, Bailey 2002), whereas policy decisions depend upon identifying the

correct conclusion.

Secondly, we are concerned that it will become increasingly difficult to reproduce results

if the processes leading to these results are unavailable or poorly documented. Reproducibility is

a fundamental requirement of all science; phenomena observed by one scientist must also be

observable by others. Reproducing results is greatly complicated when the phenomenon of

interest is manifest only through datasets that have been combined and processed through


complex sequences of computer-based tools and procedures. Available datasets must be

accompanied by metadata (Michener et al. 1997) sufficient to allow the data to be distinguished

from other datasets with which they might be confused and to permit data files to be correctly

read and interpreted (Andelman et al. 2004). For example, sediment cores are used routinely to

reconstruct past climates. Some lakes, such as Elk Lake, Minnesota, have been cored repeatedly

by different teams of investigators (Bradbury and Dean 1993, Smith et al. 1997), resulting in a

comprehensive dataset that can be profitably synthesized because comprehensive, standardized

metadata are available.5

The final concern is that the results and analysis may not be repeatable, even by the

original group of scientists that collected and analyzed the data. Repeatability requires that the

process initially used to generate published results, including the tools and algorithms employed

by that process, must be available or well documented. Modifications to any of these processes

may be inadvertent, as when a software package is updated or the underlying operating systems

are modified. Lacking awareness of these modifications, synthesis may proceed under the

incorrect assumptions that the original process is being used. If changes have been made, then

the original scientific process may not be repeated, leading either to different results or to the

false conclusion that confirmation of prior results has occurred.

Our view is that all three concerns – reliability, reproducibility, and repeatability – can be

significantly addressed through the use of clear, comprehensive, and accurate documentation of

scientific processes, using formal process definition technologies. Such documentation will

ensure that the data and processes are usable by scientists in ways that facilitate the reliability,

reproducibility, and repeatability of scientific results. Our proposed solution builds upon

5 World Data Center for Palaeoclimatology: http://www.ngdc.noaa.gov/paleo/paleo.html


advances in electronic communication and computer science technology to ease and clarify,

rather than impede and complicate, scientific discovery.

We refer to our solution as an analytic web. An analytic web consists of collections of

representations (in our case, annotated graphs) describing the scientific processes employed, the

datasets used and derived, and the relationships among them. These representations may include

links to physical objects (e.g., URLs to reference data or to other analytic webs) and, thus,

describe graphs of graphs or “webs” of data and process descriptions. To help scientists create,

use, and archive analytic webs, we are developing a toolset, called SciWalker, to support the

definition, execution, evaluation, and reuse of analytic webs. Before describing an analytic web

in detail, we illustrate the need for it with a motivating example.

AN EXAMPLE ILLUSTRATES THE NEED FOR AN ANALYTIC WEB

The increasing atmospheric concentration of carbon dioxide (CO2) and its relationship to

global temperature are well known (IPCC 2001). General circulation models (Cramer et al. 2004,

Meehl et al. 2004) and direct measurements of carbon exchange between ecosystems and the

atmosphere (Barford et al. 2001) provide data that are used to: estimate whether different

ecosystems are sources or sinks for CO2; predict how changes to these ecosystems will alter

atmospheric CO2 levels; and forecast how these estimates and predictions will change under

different climate change scenarios (Wigley and Raper 1992, IPCC 2001). Both direct

measurement of carbon exchange and synthetic modeling of whole ecosystems are relatively

new, complex tasks that require large amounts of data from diverse sources. These datasets are

processed with a range of statistical and mathematical models, and the models can provide


continuous estimates of CO2 exchange rate in near real-time. Data posted to the World Wide

Web following publication are highly processed,6 and the reproducibility and repeatability of

these processes contributes to the reliability of these estimates, predictions, and forecasts.

Measurements of ecosystem CO2 flux

Long-term measurements of carbon exchange of whole ecosystems began in the mid-

1980s to early 1990s and primarily have used the eddy covariance method (Baldocchi et al.

1988, Wofsy et al. 1993, Hollinger et al. 1994, Barford et al. 2001). Briefly, this method

estimates CO2 flux from the covariance of CO2 concentration and vertical wind velocity

(Baldocchi et al. 1988). The estimates are accurate if three conditions are satisfied:

1. The concentration of CO2 in air moving horizontally across the surface must be relatively

constant in time. Any change in its concentration due to horizontal air movement must be

adjusted for.

2. All vertical air motions must be accounted for. Since air can move in very small vertical

eddies, which complete a circular motion in a few seconds, vertical velocity and the

concentration of CO2 must be measured at least several times per second.

3. Air parcels near the surface must reach the measurement point by vertical motion

(convection) more rapidly than they leave the area by horizontal motion (advection).

Barriers between the ground surface and the measurement point may exist, for example a

forest canopy between the ground and a measurement point above the canopy. These

barriers must be overcome by a combination of sufficient turbulence between the

6 e.g., Worldwide Fluxnet network: http://daac.ornl.gov/FLUXNET/


measurement point and the surface, and sufficient buoyancy in air parcels near the

surface.

During daylight hours, forest canopies rarely present a serious barrier to accurate eddy

covariance measurements, because solar heating of the ground surface heats the air near the

surface, creating both buoyancy and rising air, or vertical convection. At night, however,

convection is much weaker and the ground may become as cold or colder than the air above it,

creating a layer of stable or even sinking air above the surface. Due to these problems, accurate

eddy covariance measurements may be unavailable for 40% to 75% of nighttime hours,

depending on the location (Wofsy et al. 1993, Goulden et al. 1996, Barford et al. 2001, Saleska

et al. 2003, Hollinger et al. 2004). Eddy covariance data also may be unreliable during daytime

periods with weak turbulence and low buoyancy of air near the ground – these are particularly

common near dawn and dusk. Lastly, data may be lost during equipment calibrations,

maintenance or malfunctions. Processes for filling in (interpolating) these data gaps must

produce unbiased estimates that are as accurate as possible. In publications with major policy

implications (Wofsy et al. 1993, Goulden et al. 1996, Barford et al. 2001, Saleska et al. 2003),

highly processed data, including the interpolated data, are often reduced to a single graph and

only the processed data are publicly available.7

Currently, geographically and temporally extensive eddy covariance data are being

synthesized to assess the impact of many different physiographic and climatic variables on

carbon storage and carbon flux. These large-scale estimates of carbon storage and carbon flux

are used by decision-makers to set policies regarding emissions, carbon sequestration, and

carbon credits and offsets (Brown 2002, Saunders et al. 2002, Bettelheim and D'Origny 2002).

7 For example, see http://www-as.harvard.edu/data/nigec-data.html


However, myriad investigators collect these large datasets, and gap-filling processes differ

among investigators. Consistent processes for data acquisition and manipulation must be used if

aggregations or syntheses of the data will be able to identify real effects and make meaningful

predictions. Differences among datasets resulting from variation in data collection and

manipulation processes can impede reliable forecasts, especially if these differences are

unknown or unaccounted for. For example, eddy covariance measurements in the Amazon basin

in the 1990’s were used to conclude that Amazon forests were storing 1 to 5 Mg C ha-1 yr-1

(Grace et al. 1995, Grace et al. 1996, Malhi et al. 1998, Carswell et al. 2002). More recently

Saleska et al. ( 2003) found an annual net carbon loss of between 0 and 2 Mg C ha-1 yr-1 from

two Amazonian forests and concluded that earlier estimates of substantial C storage may have

been caused by inclusion of data from nighttime periods when low turbulence was reducing the

measured carbon efflux. The result may have been an underestimate of nighttime C release and

an overestimate of annual C storage. If an accurate and easily understandable description of the

processes that were actually used were available, then the source of this discrepancy would be

easier to determine and may have even been avoided.

A prose description of the data-acquisition process

We measure east-west (u), north-south (v), and vertical (w) wind velocity components at

5 Hz with a sonic anemometer (Campbell Scientific, Logan, Utah) mounted on a tower extending

approximately 5 m above the tree canopy (Hadley and Schedlbauer 2002). Water vapor (q) and

CO2 (c) mixing ratios at the location of the sonic anemometer are simultaneously measured at 5

Hz by a closed path infrared gas analyzer (Licor Inc., Lincoln, Nebraska). A laptop computer

collects both data streams and computes ten-minute running means of each of these six tower


variables. Every 30 minutes, the computer calculates mean values of u, v, w, q, and c and the

covariance of u, v, q, and c, with deviations from the running mean of w. A data logger at the

same site measures environmental characteristics that have significant effects, including

photosynthetically active radiation (PAR), air temperature, soil temperature, atmospheric

humidity and soil moisture. These environmental variables are measured every 30 seconds and

averaged every 30 minutes.

Before the flux of CO2 can be estimated, the 30-minute means and covariances of the

tower variables are checked by the investigator to determine which observations are associated

with atmospheric conditions under which measured CO2 flux is not limited by turbulence or

weak vertical air movement. This is most easily done for data collected at night when turbulence

is low and the probability of inadequate vertical air motion is greatest; for illustrative purposes,

we describe the process only for nighttime CO2 flux data.

First, we discard observed values of CO2 flux if the wind direction is unsuitable for flux

measurements. A particular wind direction may be unsuitable because local topography in a

given direction creates unpredictable turbulence patterns or because the forest of interest does

not occur over a sufficient fetch in that direction.

Second, we examine the relationship between friction velocity, u* (a measure of

turbulence [in m/s], which equals the square root of vertical momentum flux) and CO2 flux for

several weeks of nighttime measurements. Flux is plotted against u* (Fig. 1), and a threshold

value of u* (u*threshold) is identified beyond which CO2 flux does not increase significantly.

Observed values of CO2 flux are discarded if u* < u*threshold. If data from all wind directions are

suitable, the u*threshold criterion typically results in the discarding of <50% of the nighttime


observations of CO2 flux. On the other hand, if some wind directions are unsuitable, >75% of the

nighttime observations may be rejected.

Next, we fill the gaps in the dataset that result from discarding observed values of CO2

flux by estimating the values that would have been observed if u* ≥ u*threshold. To fill these gaps,

we fit regression models of the reliable observations (CO2 flux | u* ≥ u*threshold) to the measured

environmental variables. For nighttime observations, the predictor variables (identified using

stepwise multiple regression) are soil temperature, air temperature and occasionally soil

moisture, which can have large effects on soil respiration when soil temperatures are relatively

warm (spring through fall; see Savage and Davidson 2001).

AN ANALYTIC WEB DEFINED

Although this prose description summarizes the scientific processes used to process the

raw data, such prose descriptions lack the precision and completeness needed to ensure

reliability, reproducibility, and repeatability. These goals are far more achievable through

formalisms developed for engineering of computer software. These formalisms can be used to

create analytic webs: clear, comprehensive, and precise definitions of the scientific processes

used to process raw and derived data. More generally, an analytic web is a coordinated set of

formal representations of key aspects of a scientific process, which together provide a clear,

precise, and comprehensive definition of the process. We suggest that such analytic webs would

be effective in supporting the synthesis of ecological data.


AREN’T THE METADATA SUFFICIENT?

Metadata comprise “the higher level information or instructions that describe the content,

context, quality, structure, and accessibility of a specific data set” (Michener et al. 1997). In

recent years there have been remarkable advances in the development of standards and tools for

creating and using ecological metadata (Andelman et al. 2004). For ecologists, perhaps the most

important of these has been the creation of a standard for structured metadata, Ecological

Metadata Language (EML). EML is a modular and flexible application of Extensible Markup

Language (XML)8 developed by the ecological community and designed to support data

discovery, access, integration, and synthesis.9 A sample EML file for the eddy flux experiment

described above can be viewed on the Harvard Forest web page.10

EML facilitates data discovery by encoding general information about datasets, such as

creator, title, abstract, keywords, and geographic, temporal, and taxonomic coverage. Access to

data is facilitated through specific information on the location (e.g., its uniform resource locator

[URL] or digital object identifier [DOI]) and the physical characteristics (e.g., file name, coding,

record delimiters, and field delimiters) of data files. Integration and synthesis are facilitated

through detailed information on dataset content (e.g., name, unit, precision, range, and number

type for each variable in a table). Retrieval of specific items of information is facilitated by the

structured nature of EML files, which are interpretable and searchable (parsable) by computers.

EML provides two modules for the documentation of methods used in the creation of a

dataset. The protocol module is used to describe an established field, laboratory, or analytical

8 http://www.w3.org/XML/9 http://knb.ecoinformatics.org/software/eml10 http://harvardforest.fas.harvard.edu, dataset HF072


procedure – essentially a standardized method. The methods module is used to describe the field,

laboratory, and analytical procedures that were actually used in the creation of a particular

dataset, and may refer to one or more relevant protocol modules, if available. The protocol

module is prescriptive, while the methods module is descriptive. Both modules have provisions

for specifying individual steps and associated citations, instrumentation, software, and quality

control, but are otherwise essentially narrative in content. The prose description above of the

process used to clean, model, and gap-fill the eddy covariance data and estimate CO2 flux is

suitable for inclusion in the methods module of an EML file. Individual steps in a process can be

expressed in EML through objects called methodStep elements.

The current version of EML (2.0.0) thus is able to capture process metadata in narrative

or sequential-narrative form. However, this may not be sufficient to allow the processed data to

be reproduced from the original data if the analytical steps involved are complex or if they use

structures such as conditionals or iteration. If not precisely stated, such complexity can create

ambiguity. In addition, some combinations of data processing techniques may lead to statistical

or logical errors. For example, further smoothing of data that have already been smoothed may

suppress important variation. Rounding errors and transcription errors may also result in the

publication of incorrect or misleading results (Garcia-Berthou and Alcaraz 2004).

We propose that every dataset generated by a research project should have attached to it

process metadata that formally describes the processes by which the data were derived, including

the sequence of tools, techniques, and intermediate datasets used. Such metadata would be a

useful complement to existing metadata standards such as EML, and in fact could be

incorporated directly into EML files. Our research has convinced us that such an approach could

have substantial value in helping scientists understand and share datasets more safely and


effectively. While many scientists currently describe their analytical processes only informally

and partially, there is growing interest in using formal notations, such as dataflow and workflow

graphs, to document these processes (e.g., Ailamaki et al. 1998, Altintas et al. 2004a, 2004b).

Our research expands on these efforts in an attempt to use more expressive formalisms to capture

the subtleties and complexities of, and to remove the potential ambiguities in, the descriptions of

the processes used by ecologists and other scientists.

FORMAL DESCRIPTION OF AN ANALYTIC WEB

An analytic web is a coordinated collection of three types of graphs – a dataflow graph, a

dataset derivation graph, and a process definition graph – all of which were originally developed

for use in defining and controlling software development projects (e.g., Ghezzi et al. 2003). We

propose to use these three graphs to formalize different aspects of ecological data processing in

such a way as to assure its reliability, reproducibility, and repeatability.

A dataflow graph (DFG) clearly defines type information: which types of datasets are

acted upon by which types of processes (tools, activities) in order to produce other types of

datasets. A DFG documents the relationships among dataset types and process types. For

example a process type of “interpolate via linear regression” could be applied to a dataset type

of “rainfall data” and the DFG would indicate the conditions under which this interpolation

would be applied. Like a recipe in a cookbook, the clarity and comprehensibility of a DFG

facilitates the reproduction of scientific processes.

The type information in a DFG is used to assure that the appropriate kinds of processes

are being applied to the appropriate kinds of data. For example, the process type could specify


the exact kind of linear regression algorithm or the particular statistical package to be applied.

Similarly, the dataset type could indicate the type of data, as noted above, but could also go

further and indicate the precise format (syntax) of each field or the access methods (semantics)

that must be associated with the data, as found in object-oriented type definitions. The type

information is used for checking type conformance; it should be possible to incorporate

automatically the structural transformations used to transform datasets of one type to another

(Bowers and Ludäscher 2004).

In contrast to the type relationships represented in a DFG, dataset derivation graph

(DDG) documents the specific instances of datasets produced by the actions of specific

processes operating on other specific datasets. Dataset instances documented by DDGs are

uniquely specified and are the usual focus of attention in reproducing and repeating scientific

processes. Examples include “rainfall data collected at the Harvard Forest on 1 June 2004 at

hourly intervals”, or “SAS version 6.1 of the non-linear regression using nls2 (Huet et al.

2004)”. Whereas a DFG is the recipe found in the 10th edition of the Joy of Cooking, the DDG is

the result: a spiced chocolate prune cake baked for Jane’s 60th birthday. Reproducibility and

repeatability demand the documentation provided by the DDG, namely the specific datasets and

tools that were actually used.

For simple scientific processes, a DFG and DDG together may be sufficient. For more

complex processing, especially processes where exceptional conditions might arise that require

special attention, a richer notation is more effective. For such cases, a process definition graph

(PDG) is used to represent complex procedural detail. The PDG emphasizes the specification of

the full range of possible activities that might be taken, such as those that happen in response to

unexpected or undesired occurrences. Continuing the cookbook analogy, a PDG could include a


specification of what to do in case the egg yolks curdle while beating up an egg-based sauce, or

steps that might be taken in case a particular specified type of ingredient is either unavailable, or

is of insufficient quality. The PDGs associated with a DFG form the basis for reasoning about

the possibility of applying incompatible statistical methods, or using inconsistent or incompatible

datasets with one another.

We describe the representations of these three types of graphs and illustrate their use in

processing the eddy covariance data to yield an estimate of nighttime carbon flux.

Dataflow graphs

Dataflow graphs (DFGs) specify the precise ways in which the various tools and

subprocesses are used to derive new datasets. In a DFG (Fig. 2a), the nodes represent the various

tools, activities or datasets and the edges represent the flow of datasets into and out of these

processes. A DFG supports the definition of additional process details such as parallelism and

iteration. The DFG defined here uses different icons to differentiate processes (ovals in Fig. 2a)

from datasets (boxes in Fig. 2a).

Data derivation graphs

A data derivation graph (DDG; Fig. 2b) keeps track of specific datasets that have been

derived by the actions of tools operating on (other) specific datasets. In a DDG, a node (clipped

box) represents each actual dataset instance created by executing a process. Each node is

connected by an edge (arrow) to the dataset(s) from which it was derived. The edge is annotated

(clipped oval) with a specification of the specific tool instance (e.g., the exact version of a


statistical routine) or sub-process instance (e.g., a reference to another DFG or PDG) used for the

derivation.

A DDG never has cycles, but grows as more process steps are executed and datasets are

created. The most recent instance of a dataset is appended to the DDG by an edge (or edges),

representing how it was derived from previously derived datasets. No datasets are ever destroyed

or supplanted in the DDG. They are simply created and left available for subsequent use.

A DFG describes how types of datasets can be created, whereas the DDG describes how

specific instances were actually derived. A DFG integrated with a DDG allows for the

identification of the sequence of processes and the datasets created by their application. For

example, although a DDG has no cycles, a DFG uses cycles to represent iteration – the

application of a specific sequence of processing steps to successive datasets. Thus, the DFG can

be viewed as a specification of a generator that builds ever-growing DDGs, creating an

interpretable audit trail of process executions. To ensure reproducibility, it is important to

preserve rather than overwrite datasets produced by such iterative processes.

As noted, there is a relationship between the instances represented in a DDG for a process

execution and the DFGs that were executed to create those instances. We have found that it is

useful to provide a view of the process that combines these two representations by imposing a

representation of the created instances onto the DFGs for that process. Since there may be many

instances of a dataset associated with a node that represents the creation of a particular dataset

type, we provide an alternative view of a DFG overlain on a DDG that “stacks” icons

representing instances behind the icon representing the particular dataset type (see Fig. 6, below).


Process definition graphs

A process definition graph (PDG; Fig. 2c) provides a more detailed view of the process

than does a DFG. Although a DFG could be used to define any process, experience has shown

that this representation is not well suited for representing complex processes that require, for

example, conditional decisions and human intervention. Such a complex process could be

abstractly represented as a single activity node (oval) in a DFG, but then could be elaborated as a

PDG. The PDG describes procedural details such as the order in which steps must be taken,

where parallelism is possible, pre-condition and post-condition checks to determine whether or

not processing has been successful, actions to try when various exceptional conditions occur, and

conditions under which processing is to be iterated or terminated. In short, the PDG specifies

detailed procedural flow.

A PDG specification (Fig. 2c), drawn in the style of the process definition language

Little-JIL (Wise 1998), represents processes as decomposition hierarchies of steps. Each step is

represented by a cluster of icons either within or around a rectangle, called the Step Bar. The

triangles outside the Step Bar are iconic representations of optional pre- (left of the Step Bar) and

post-condition (right of the Step Bar) steps used to control and monitor the execution of the step.

When these triangles are not filled in, then there are no pre- or post-conditions specified, as in

this figure. The leftmost icon inside the Step Bar indicates the sequencing order of the sub-steps.

In Fig. 2c, the leftmost icon is an arrow, which indicates that this step is processed in sequential

order by executing its (hierarchical) sub-steps in order from left to right. There are also icons to

represent parallel execution and non-deterministic selection, among others. The sub-steps

associated with the rightmost icon (X) inside the Step Bar, specifies the behavior to be followed

if exceptional conditions arise.


In Figure 2c, the process step labeled Summarize Data is defined as the composition of

three substeps, Process Item, Compute Summary, and Fix Problem. Process Item and Compute

Summary are associated with normal process flow, which is specified by the sequence icon (e.g.,

an arrow, representing sequential left to right flow). Process Item has variable cardinality, as

represented by the asterisk on its incoming edge, indicating that zero or more instances of this

step will need to be executed, depending on the number of items available to be processed. Since

this is a sequential step, each instance of Process Item will be completed before the next one is

initiated and Compute Summary will only start after the last (and thus all) instances of Process

Item have completed execution. The sub-step Fix Problem is associated with exceptional process

flow, as indicated by the X icon in the Step Bar, and will only be invoked if the exception Fix

Problem is raised by any of the sub-steps associated with process step Summarize Data.

Although not shown in Figure 2c, the edges connecting a higher-level step to a lower-

level step may have associated annotations that define the datasets that flow back and forth

between steps, and a step may include a specification of the resources needed for its execution. A

principal resource to be specified for each step is the agent responsible for the step’s execution.

This agent may be hardware, software, and/or human. In all cases, it is necessary to detail the

nature of the agent and other resources needed.

Applying an analytic web to the eddy flux data

We illustrate the application of an analytic web through the example of examining the

effect of varying u*threshold on the estimate of nighttime carbon derived from eddy covariance

data. The date we used were collected in 2000 and 2001 in a hemlock forest in central

Massachusetts (Fig. 1; Hadley and Schedlbauer 2002). Because the tower was sited near the


northeast corner of a hemlock stand, only data collected when the wind blew from the southwest

(180–270º E of N) were usable for estimating carbon flux from the hemlock stand. As described

above (A prose description of the data-acquisition process), we therefore first discarded data

collected when the wind was blowing from any direction other than the southwest (i.e., 0–179º E

of N and 271–359º E of N). The application of the u*threshold criterion also resulted in discarding

some of the data even when the wind was from the southwest. We used a linear regression model

using data from independently measured environmental variables (e.g., soil temperature, soil

moisture) to fill the resulting gaps in the dataset (Hadley and Schedlbauer 2002). Estimated

carbon flux increased non-linearly with the value of u*threshold, and it appeared that C flux

approached an asymptote, whereas the fraction of data retained decreased linearly with u*threshold

(Fig. 1). This result is quite similar to that of Hollinger et al. (2004) who analyzed four years of

nighttime eddy covariance data in a coniferous forest in Maine, USA.

The dataflow graph (DFG) for creating usable datasets from the eddy covariance data

(Fig. 3) includes dataset types Tower Data, Environmental Data, Selection Criteria, Aggregated

Data, Excluded Data, Rejected Data, Selected Data, Interpolated Data, and Row-Filled Data. As

models and criteria are also fixed data types, the Interpolation Model and the Selection Criteria

are also represented as a box. Processes, represented by ovals, include Create Aggregated Data,

Segregate Data, Create Interpolation Model, Apply Interpolation Model, and Merge Datasets

and represent types of actions that are to be applied to particular types of data. Note that a

process may require more than one input; for example, Interpolation Model and Excluded Data

are both required for the process Apply Interpolation Model. A dataset may also be input to more

than one process. The diamond indicates that the same instance of data type Selected Data is

input to both the Create Interpolation Model and Merge Data processes.


Figure 4 illustrates a modest enhancement of the DFG shown in Fig. 3 in which the

criteria used to partition the data may be modified after examining the results of interpolating

and merging the data (the Revise Selection Criteria process uses the formerly final data type

Row-Filled Data). This iterated DFG (Fig. 4) specifies that new selection criteria will be created

based on examination of the Row-Filled Data and that these criteria may be applied to existing

datasets, generating new Row-Filled Data. The process indicates that this iteration might be

repeated, causing the successive generation of new criteria and new output data. Thus, this

process might produce a sequence of criteria and a sequence of datasets. It is vital that there be a

firm connection between each dataset and the model from which it was created.

The data derivation graph (DDG), shown in Figure 5, resulting from the execution of the

DFG shown in Figure 4, illustrates the connections between each instance of the dataset and the

process that created it. This DDG depicts the state of the execution of the DFG at the beginning

of the second iteration. As in Fig. 2, ovals with clipped corners in Fig. 5 denote specific dataset

instances. Each clipped box represents the dataset derived by the application of the activity from

which it emanates to the dataset(s) input to that activity.

Rather than depicting DDGs independently, the stacked view (Fig. 6) shows the

relationship between a dataset type in the DFG and each dataset instance created during

execution. This representation more clearly shows the specific dataset instances that have been

created in successive (iterative) applications of an analytic web.

Although the DDG and the stacked view provide a clearer view of how certain datasets

have been created, metadata are still needed to clarify the origins and particulars of the datasets

(particularly the Tower Data and the Environmental Data) and the Interpolation Model. Such


metadata are essential for the repeatability and reproducibility of the final results. EML readily

accommodates this kind of metadata.

Just as dataset metadata provides additional information about a dataset instance to

ensure reproducibility and repeatability, more detailed process information is also sometimes

needed to carefully delineate the actions and decisions associated with a process. We currently

support two ways of providing this detail. One is by decomposing any DFG activity node into a

more detailed DFG. This hierarchical decomposition is a good approach for modeling the

relationship between a more abstract concept and its more detailed representation. Such

decomposition can be repeated to any level of detail. Alternatively, the details associated with a

DFG activity node can be represented by a process definition graph (PDG).

A PDG of the DFG shown in Figure 4 illustrates this point (Fig. 7). Without delving

deeply into the syntax and semantics of Little-JIL (see Wise 1998 for a complete description),

there are two details of particular import in this diagram. First, the filled triangle on the left side

of the Step Bar for Create Interpolation Model indicates a guard or pre-condition on this step.

This pre-condition will assure that the dataset is of sufficient size to create the model. Second,

the exception handler (X) associated with the step Create Row-Filled Data specifies that if this

pre-condition fails, then the process step associated with this handler will intercede to change the

selection criteria. Without support for exception handling, it is more difficult to encode repair

activities.


EXPERIMENTAL EVALUATION THROUGH THE SCIWALKER TOOL

We have developed a tool, called SciWalker, which uses the process and data

derivation representations (DFGs, DDGs, and PDGs) to support the definition and execution of

analytic webs. The instances in a DFG or PDG supported by SciWalker are all accompanied

by specific metadata annotations, viewed through clickable menu items that provide specific

information about how they were generated. The process metadata generated by SciWalker

are easily adapted for inclusion in XML applications such as EML (Table 2).

By using SciWalker to automate the process of estimating nighttime carbon exchange

from eddy covariance data, we have been able to quickly determine the effect of varying

u*threshold on estimated nighttime C flux from a forest. Simultaneously, we used SciWalker to

create the DDG of this process, which provides a comprehensive audit trail that is easily

accessible via the Internet. With this audit trail, data that were included or excluded can be easily

retrieved and examined. Others have examined effects of u*threshold on estimates of carbon flux

(Barford et al. 2001, Saleska et al. 2003, Hollinger et al. 2004), but neither the sets of accepted

and excluded data, nor the gap-filling models used in these studies, are readily accessible by

other researchers. Using an analytic web is a step forward both in ease and speed of data

processing and analysis for us as individual researchers, and also a great leap forward in

communicating our data analysis procedures to others.

While influences of u*threshold on ecosystem carbon flux estimated from eddy covariance

data have been examined in several papers, effects of other meteorological variables have been

examined less frequently. Wind direction is of particular interest, because forest composition is

rarely uniform around an eddy covariance tower. In general, one cannot relate carbon flux to a


specific type of forest without limiting the range of wind directions that provide acceptable data

for processing. As carbon exchange estimates and statistical models of carbon exchange for one

forest cannot be applied to other forests unless the forest composition is similar in the two areas,

it is often important for researchers to partition eddy covariance data by wind direction in order

to confine measurements to a specific forest type. The Analytic Web provides an approach to do

this, by allowing for the specification of the range of wind direction for included versus excluded

data. In the case of our estimates of C exchange by a hemlock forest, we included data only if the

winds were from the southwest (180-270o compass bearing) because of the relatively small size

of the hemlock forest we were studying, and the position of the eddy covariance tower in the

northeast corner of hemlock-dominated forest. Using SciWalker, we are now examining the

effects of using other ranges of wind direction, thereby including other forest types within the

eddy covariance footprint.

DISCUSSION AND FUTURE DIRECTIONS

There is an important need to develop tools and techniques that will support repeatability,

reproducibility, and reliability of the scientific processes used to analyze ecological data. Our

proposal for the use of software engineering technology to create analytic webs as syntheses of

three types of graphs seems quite promising, based upon our initial work with the SciWalker

prototype and its application to eddy flux data used to estimate whether forests are sources or

sinks of CO2. In its current implementation, SciWalker has not yet integrated the PDGs with

the DFGs and DDGs. However, even without the full capabilities provided by PDGs, our

application of SciWalker already has led to new scientific insights and interesting new results.


Future versions of SciWalker will incorporate the PDGs and support assessment of

process reliability. It is our goal that scientific analyses eventually be accompanied by process

metadata derived from the (presumably successful) application of formal process analyzers to

our process metadata. These certifications would then be usable by other scientists to guide them

away from dangerous misuse of datasets or combinations of processes. The net result will be

science that is not only more rapid and efficient, but also more safe and sound.


ACKNOWLEDGMENTS

This work was supported by NSF grant CCR-0205575 and is a contribution from the Harvard

Forest Long Term Ecological Research (LTER) program. We thank Elizabeth Farnsworth, Nick

Gotelli, Matt Jones, and Kristin Vanderbilt, for critical comments on drafts of the manuscript.


LITERATURE CITED

Ailamaki, A., Y. E. Ioannidis, and M. Livny. 1998. Scientific workflow management by database

management. Proceedings of the 10th international conference on scientific and statistical

database management. Capri, Italy.

Aloisio, G., and M. Cafaro. 2003. A dynamic earth observation system. Parallel Computing

29:1357-1362.

Altintas, I., C. Berkeley, E. Jaeger, M. Jones, B. Ludäscher, and S. Mock. 2004a. Kepler: an

extensible system for design and execution of scientific workflows. Proceedings of the

16th international conference on scientific and statistical database management. Santorini

Island, Greece.

Altintas, I., C. Berkeley, E. Jaeger, M. Jones, B. Ludäscher, and S. Mock. 2004b. Kepler:

towards a grid-enabled system for scientific workflows. Proceedigs of the tenth global

grid forum - the workflow in grid systems workshop. Berlin, Germany.

Andelman, S. J., C. M. Bowles, M. R. Willig, and R. B. Waide. 2004. Understanding

environmental complexity through a distributed knowledge network. BioScience 54:240-

246.

Bailey R. E. 2002. Global warming and other eco-myths: how the environmental movement uses

false science to scare us to death. Prima Publishing, Roseville, California, USA.

Baldocchi, D. D., B. B. Hicks, and T. P. Meyers. 1988. Measuring biosphere-atmosphere

exchanges of biologically related gases with micrometeorological methods. Ecology

69:1331-1340.

Barford, C. C., S. C. Wofsy, M. L. Goulden, J. W. Munger, E. H. Pyle, S. P. Urbanski, L.

Hutyra, S. R. Saleska, D. Fitzjarrald, and K. Moore. 2001. Factors controlling long- and


short-term sequestration of atmospheric CO2 in a mid-latitude forest. Science 294:1688-

1691.

Belovsky, G. E., D. B. Botkin, T. A. Crowl, K. W. Cummins, J. F. Franklin, M. L. Hunter, Jr., A.

Joern, D. B. Lindenmayer, J. A. MacMahon, C. R. Margules, and J. M. Scott. 2004. Ten

suggestions to strengthen the science of ecology. BioScience 54:345-351.

Bettelheim, E. C., and G. D'Origny. 2002. Carbon sinks and emissions trading under the Kyoto

Protocol: a legal analysis. Philosophical Transactions of the Royal Society of London

A360:1827-1851.

Billard, L., and E. Diday. 2003. From the statistics of data to the statistics of knowledge:

symbolic data analysis. Journal of the American Statistical Association 98:470-487.

Bowers, S. and B. Ludäscher. 2004. An ontology-driven framework for data transformation in

scientific workflows, Pages 1-16 in E. Rahm, editor. Proceedings of the 1st international

workshop in data integration in the life sciences (DILS). Springer-Verlag, Berlin,

Germany.

Bradbury, J. P., and W. E. Dean. 1993. Elk Lake, Minnesota: evidence for rapid climate change

in the north central United States. GSA Special Paper 276:1-336.

Brown, S. 2002. Measuring, monitoring, and verification of carbon benefits for forest-based

projects. Philosophical Transactions of the Royal Society of London Series A360:1669-

1683.

Buja, A., D. Cook, and D. F. Swayne. 1996. Interactive high-dimensional data visualization.

Journal of Computational and Graphical Statistics 5:78-99.

Carswell, F. E., A. L. Costa, M. Palheta, Y. Malhi, P. Meir, J. D. R. Costa, M. D. Ruivo, L. D.

M. Leal, J. M. N. Costa, R. J. Clement, and J. Grace. 2002. Seasonality in CO2 and H2O


flux at an eastern Amazonian rain forest. Journal of Geophysical Research-Atmospheres

107:8076.

Cramer, W., A. Bondeau, S. Schaphoff, W. Lucht, B. Smith, and S. Sitch. 2004. Tropical forests

and the global carbon cycle: impacts of atmospheric carbon dioxide, climate change and

rate of deforestation. Philosophical Transactions of the Royal Society of London

B359:331-343.

Esser, G., H. F. H. Lieth, J. M. O. Scurlock, and R. J. Olson. 2000. Osnabrück net primary

productivity data set. Ecology 81:1177.

Foster I., and C. Kesselman. 1999. The Grid: blueprint for a new computing infrastructure.

Morgan-Kaufmann Publishers, San Francisco, CA.

Garcia-Berthou, E., and C. Alcaraz. 2004. Incongruence between test statistics and P values in

medical papers. BioMed Central Medical Research Methodology 4:13.

Ghezzi C., M. Jazayeri, and D. Mandrioli. 2003. Fundamentals of software engineering, 2nd

edition. Pearson Education, Inc., Upper Saddle River, New Jersey.

Gotelli N. J., and A. M. Ellison. 2004. A primer of ecological statistics. Sinauer Associates,

Sunderland, Mssachusetts.

Goulden, M. L., J. W. Munger, S.-M. Fan, B. C. Daube, and S. C. Wofsy. 1996. Exchange of

carbon dioxide by a deciduous forest: response to interannual climate variability. Science

271:1576-1578.

Grace, J., J. Lloyd, J. McIntyre, A. C. Miranda, P. Meir, H. S. Miranda, C. Nobre, J. Moncrieff,

J. Massneder, Y. Malhi, I. Wright, and J. Gash. 1995. Carbon dioxide uptake by an

undisturbed tropical rainforest in western Amazonia, 1992 to 1993. Science 270:778-780.


Grace, J., Y. Malhi, J. Lloyd, J. McIntyre, A. C. Miranda, P. Meir, and H. S. Miranda. 1996. The

use of eddy covariance to infer the net carbon dioxide uptake of the Brazilian rainforest.

Global Change Biology 2:209-218.

Hadley, J. L., and J. L. Schedlbauer. 2002. Carbon exchange of an old-growth eastern hemlock

(Tsuga canadensis) forest in central New England. Tree Physiology 22:1079-1092.

Hall, B., G. Motzkin, and D. R. Foster. 2002. Three hundred years of forest and land-use change

in Massachusetts. Journal of Biogeography 29:1319-1336.

Helly, J. J., T. T. Elvins, D. Sutton, D. Martinez, S. E. Miller, S. Pickett, and A. M. Ellison.

2002. Controlled publication of digital scientific data. Communications of the

Association for Computing Machinery 45:97-101.

Hollinger, D. Y., J. D. Aber, B. Dail, E. A. Davidson, S. M. Goltz, H. Hughes, M. Y. Leclerc, J.

T. Lee, A. D. Richardson, C. Rodrigues, N. A. Scott, D. Achuatavarier, and J. Walsh.

2004. Spatial and temporal variability in forest-atmosphere CO2 exchange. Global

Change Biology in press.

Hollinger, D. Y., F. M. Kelliher, J. N. Byers, J. E. Hunt, T. M. McSeveny, and P. L. Weir. 1994.

Carbon dioxide exchange between an undistirubed old-growth temperate forest and the

atmosphere. Ecology 56:134-150.

Huet S., A. Bouvier, M.-A. Poursat, and E. Jolivet. 2004. Statistical tools for nonlinear

regression: a practical guide with S-Plus and R examples, 2nd edition. Springer-Verlag,

Inc., New York, New York.

Intergovernmental Panel on Climate Change (IPCC). 2001. Climate change 2001: synthesis

report. IPCC Secretariat, Geneva, Switzerland.


Jones, M. B., C. Berkley, J. Bojilova, and M. Schildhauer. 2001. Managing scientific metadata.

IEEE Internet Computing September/October 2001:59-68.

Kareiva, P. M. 2002. Applying ecological science to recovery planning. Ecological Applications

12:629.

Krishnan, A. 2004. A survey of life sciences applications on the Grid. New Generation

Computing 22:111-125.

Li, W. W., R. W. Byrnes, J. Hayes, A. Birnbaum, V. M. Reyes, A. Shahab, C. Mosley, D.

Pekurovsky, G. B. Quinn, I. N. Shindyalov, H. Casanova, L. Ang, F. Berman, P. W.

Arzberger, M. A. Miller, and P. E. Bourne. 2004. The encyclopedia of life project: Grid

software and deployment. New Generation Computing 22:127-136.

Lomborg B. 2001. The skeptical environmentalist: measuring the real state of the world.

Cambridge University Press, Cambridge, UK.

Lubchenco, J., A. M. Olson, L. B. Brubaker, S. R. Carpenter, M. M. Holland, S. P. Hubbell, S.

A. Levin, J. A. MacMahon, P. A. Matson, J. M. Melillo, H. A. Mooney, C. H. Peterson,

H. R. Pulliam, L. A. Real, P. J. Regal, and P. G. Risser. 1991. The sustainable biosphere

initiative: an ecological research agenda. Ecology 72:371-412.

Malhi, Y., A. D. Nobre, J. Grace, B. Krujit, M. G. P. Pereira, A. Culf, and S. Scott. 1998. Carbon

dioxide transfer over a central Amazonian rain forest. Journal of Geophysical Research

D24:31593-31612.

Meehl, G. A., W. M. Washington, J. M. Arblaster, and A. X. Hu. 2004. Factors affecting climate

sensitivity in global coupled models. Journal of Climate 17:1584-1596.


Michener, W. K. 2000. Ecological knowledge and future data challenges. Pages 162-174 in W.

K. Michener, and J. W. Brunt, editors. Ecological data: design, management and

processing. Blackwell Science, Oxford, UK.

Michener, W. K., T. J. Baerwald, P. Firth, M. A. Palmer, J. L. Rosenberger, E. A. Sandlin, and

H. Zimmerman. 2001. Defining and unraveling biocomplexity. BioScience 51:1018-

1023.

Michener, W. K., J. W. Brunt, J. J. Helly, T. B. Kirchner, and S. G. Stafford. 1997.

Nongeospatial metadata for the ecological sciences. Ecological Applications 7:330-342.

Regan, H. M., M. Colyvan, and M. A. Burgman. 2002. A taxonomy and treatment of uncertainty

for ecology and conservation biology. Ecological Applications 12:618-628.

Reinefeld, A., and F. Schintke. 2002. Concepts and technologies for a worldwide grid

infrastructure. Proceedings of Euro-Par 2002 Parallel Processing 2400:62-71.

Saleska, S. R., S. D. Miller, D. M. Matross, M. L. Goulden, S. C. Wofsy, H. R. de Rocha, P. B.

de Camargo, P. Crill, B. C. Daube, H. C. de Freitas, L. Hutyra, M. Keller, V. Kirchnoff,

M. Menton, J. W. Munger, E. Hammond-Pyle, A. H. Rice, and H. Silva. 2003. Carbon in

Amazon forests: unexpected seasonal fluxes and disturbance-induced losses. Science

302:1554-1557.

Saunders, L. S., R. Hanbury-Tenison, and I. R. Swingland. 2002. Social capital from carbon

property: creating equity for indigenous people. Philosophical Transactions of the Royal

Society of London A360:1763-1775.

Savage, K. E., and E. A. Davidson. 2001. Interannual variation of soil respiration in two New

England forests. Global Biogeochemical Cycles 15:337-350.


Schemske, D. W., B. C. Husband, M. H. Ruckelshaus, C. Goodwillie, I. M. Parker, and J. G.

Bishop. 1994. Evaluating approaches to the conservation of rare and endangered plants.

Ecology 75:584-606.

Smith, A. J., J. J. Donovan, E. Ito, and D. R. Engstrom. 1997. Ground-water processes

controlling a prairie lake's response to middle Holocene drought. Geology 25:391-394.

Wegman, E. J. 1995. Huge data sets and the frontiers of computational feasibility. Journal of

Computational and Graphical Statistics 4:281-285.

Wigley, T. M. L., and S. C. B. Raper. 1992. Implications for climate and sea level of revised

IPCC emissions scenarios. Nature 357:293-300.

Wise A. 1998. Little-JIL 1.0 language report. Computer Science Technical Report, University of

Massachusetts, Amherst, Massachusetts.

Wofsy, S. C., M. L. Goulden, J. W. Munger, S. M. Fan, P. S. Bakwin, B. C. Daube, S. L.

Bassow, and F. A. Bazzaz. 1993. Net CO2 exchange in a mid-latitude forest. Science

260:1314-1317.


Table 1. A taxonomy of data set sizes (after Wegman 1995).

Descriptor Data set size (bytes) Storage

Tiny 102 A piece of paper

Small 104 A notebook

Medium 106 Removable media (e.g., floppy disk)

Large 108 A single hard disk

Huge 1010 Multiple hard disks or RAID arrays

Ridiculous 1012 Robotic tape storage silos


Table 2. Simplified excerpt from an XML file describing the Analytic Web shown in Fig. 3. The

data types for Selected Data and Interpolation Model (boxes in Fig. 3) are declared through

references to associated EML files. The activity Create Interpolation Model (oval) is declared by

including the actual script in the R statistical language. The connections (edges) in the Dataflow

Graph are specified by the <connector> and <binding> elements.

<analyticWeb><type-declaration name="cflux-data">

<attributes>year,doy,u,ustar,wdir,par,tair,dmintair,tsoil,vpd,mco2flux,eco2flux,source</attributes>

<body><metadata object="cflux-data.xml" language="EML" version="2.0.0"/>

</body></type-declaration><type-declaration name="nighttime-model">

<attributes>mco2flux,tsoil</attributes><body>

<metadata object="nighttime-model.xml" language="EML" version="2.0.0"/></body>

</type-declaration><activity-declaration name="create-model">

<input name="data" type="cflux-data"/><output name="model" type="nighttime-model"/><body>

<script language="R" version="1.9.0">model=lm(log(mco2flux)~tsoil, data)

</script></body>

</activity-declaration><dataflow-declaration name="nighttime-cflux">

<data name="selected-data" type="cflux-data"/><data name="interpolation-model" type="nighttime-model"/><activity name="create-interpolation-model" type="create-model"/><connector kind="fork">

<binding from="selected-data" to="create-interpolation-model"/><binding from="selected-data" to="merge-data"/>

</connector><connector kind="simple">

<binding from="create-interpolation-model" to="interpolation-model"/></connector>

</dataflow-declaration></analyticWeb>


Figure 1. Estimated nighttime CO2 flux (solid circles and solid line), and fraction of data used in

the analysis (open circles and dashed line) from April – June 2001 above a central Massachusetts

hemlock forest, for southwesterly winds and for different ranges of friction velocity (u*) (Hadley

and Schedlbauer 2002).


Figure 2. Examples of a data flow graph (DFG), data derivation graph (DDG) and process

definition graph (PDG). In a data flow graph (A) datasets are indicated by rectangles and

processes are indicated by ovals. Edges (arrows) indicate the direction of flow. In this DFG,

datasets of Type 1 and Type 2 are used by Tool A to create a dataset of Type 3. A data derivation

graph (B) illustrates the particular outcome or instance of executing the DFG. Instances of

datasets are indicated by rectangles with one corner clipped off, and instances of applying a

specific tool are indicated by ovals with all four corners clipped off. Edges (arrows) indicate how

a given dataset was derived from a previous dataset. These edges are annotated (dotted lines) to

indicate which process was applied. In this DDG, a particular dataset of Type 3 was derived from

two particular datasets of Type 1 and Type 2 respectively, by applying a specific kind of Tool A.

A process definition graph (C) is a symbolic representation of the procedural details required by

the DFG to produce the DDG. Each step in the procedure is indicated by a set of icons


surrounding a rectangular Step Bar. Icons outside the Step Bar specify pre-condition (to the left

of the Step Bar) or post-condition (to the right of the Step Bar) steps that control and monitor the

execution of the step. Icons within the Step Bar specify the ordering of the process (sequential,

parallel, or non-deterministic), and behavior to follow if exceptions occur. The circles are

handles to which can be attached specifications of resources needed in order to support execution

of the step. Asterisks indicate that zero or more instances of this step require execution,

dependent on the number of items available for processing.


Figure 3. A data flow graph (DFG) illustrating the process used for the analysis of the eddy

covariance data and for tracking the effects of changes in the output resulting from changes in

u*threshold (one of the Selection criteria). As in Fig. 2A, datasets are indicated by rectangles and

processes are indicated by ovals. Arrows (edges) indicate the direction that the processes flow.

The diamond indicates that the Selected Data are used in both the Create Interpolation Model

process and in the Merge Data process. Creating this DFG is the first step in generating process

metadata for the analysis of the eddy covariance data.


Figure 4. A data flow graph (DFG) that incorporates an iterative loop into the process used for

the analysis of the eddy covariance data and for tracking the effects of changes in the output

resulting from changes in u*threshold (one of the Selection Criteria). As in Fig. 2A and Fig. 3,

datasets are indicated by rectangles and analytical processes by ovals. Arrows (edges) indicate

the direction that the processes flow. Diamonds indicates points where datasets can be used

multiple times, or where decisions about processes have to be made.


Figure 5. A data derivation graph (DDG) specifying the instance (or result) of a single

application of the DFG shown in Fig. 4. The specific datasets used (e.g., Selected Data 1) are

indicated by clipped boxes, and the specific processes used (e.g., application of the first

Interpolation Model) are indicated by clipped ovals. Arrows (edges) indicate from which

previous datasets a given dataset was derived. The dotted lines (annotations) indicate which

process was applied to which sets of data.


Figure 6. A stacked view of a DFG and DDG that illustrates different instances using stacked

icons. The DFG is at the bottom of the stack (grey boxes and ovals), and the instances (DDGs)

are placed on top of the DFG (white boxes and ovals). This example combines the DFG shown

in Fig. 4 with the DDG shown in Fig. 5.


Figure 7. A process definition graph (PDG) using Little-JIL notation for the analytic web

describing the analysis of eddy covariance data. Each step in the process is illustrated with a Step

Bar (rectangle) surrounded by a set of icons. Open triangles to the right and left of each step bar

indicate that there are no pre- or post-condition requirements for each step, whereas the closed

triangle to the right of the Create Interpolation Model Step Bar indicates that there is a pre-

condition requirement (here, that sample size is sufficiently large to create the model). Icons

within Step Bars specify the ordering of the process. The arrows in Estimate Total Nighttime

Carbon Flux, Create Row-Filled Data, Apply Selection Criteria, and Evaluate and Revise

specify sequential (row-by-row) processing. The X in Create Row-Filled Data specifies an

exception handler, and the connected icon (the circle with return-arrow) symbolizes the process

to be carried out to handle the exception. The asterisk above Evaluate and Revise indicates that

zero or more instances of this step may require execution. The circles are handles to which can


be attached specifications of resources needed in order to support execution of the step.

Execution proceeds from bottom right to top left.

a. m. ellison et al. - 1laser.cs.umass.edu/techreports/04-79.pdf · a. m. ellison et al. - 2...

Documents