a. m. ellison et al. - 1laser.cs.umass.edu/techreports/04-79.pdf · a. m. ellison et al. - 2...
TRANSCRIPT
A. M. Ellison et al. - 1
AN ANALYTIC WEB TO SUPPORT THE ANALYSIS AND SYNTHESIS OF ECOLOGICAL DATA
Aaron M. Ellison,1,3 Leon J. Osterweil,2 Julian L. Hadley,1 Alexander Wise,2 Emery Boose,1
Lori Clarke,2 David R. Foster,1 Allen Hanson,2 David Jensen,2
Paul Kuzeja,1 Edward Riseman,2 and Howard Schultz2
1Harvard University, Harvard Forest, P.O. Box 68, Petersham, Massachusetts 01366, USA
2University of Massachusetts, Department of Computer Science, Computer Science Building,
Amherst, MA 01003
3Author for correspondence:Aaron M. [email protected]: 978-724-3302fax: 978-724-3595
For submission to: EcologyType of submission: Concepts & SynthesesNumber of words: 10,050Number of tables: 2Number of figures: 7Date of submission: 8 September 2004
A. M. Ellison et al. - 2
Abstract. Synthesis of ecological data is a major vehicle for addressing questions that span a
wide range of spatial and temporal scales. Useful syntheses should be reliable, reproducible, and
repeatable. Such synthesis requires access to data, accurate and precise descriptions of those data
(metadata), and documentation of the analytical processes used to derive usable datasets from
collections of raw data that may be massive and may include real-time data streams. We describe
a formal representation – an analytic web – that provides clear, complete, and precise definitions
of scientific processes used to process raw and derived data. The formalisms used to define
analytic webs are adaptations of those used in software development, and they provide a novel
and effective support system for the synthesis of ecological data. We illustrate the utility of an
analytic web through the analysis and synthesis of long-term measurements of whole-ecosystem
carbon exchange, and in the construction of a complete, internet-accessible audit trail of the
analytic processes used to determine the effect of varying model parameters on estimation of
nighttime carbon flux. The analytic web not only increased the ease and speed of processing and
synthesizing ecological data but also provided a great leap forward in communicating the results
of our efforts. Our software toolset, SciWalker, can be used to create a variety of analytic
webs, including webs of webs, tailored to specific data synthesis activities. The process metadata
created by SciWalker is readily adapted for inclusion in Ecological Metadata Language
(EML) files.
Key words: Analytic web; eddy covariance; EML; metadata; synthesis; SciWalker; XML
A. M. Ellison et al. - 3
INTRODUCTION
Examining complex questions, integrating information from a variety of disciplines, and
testing hypotheses at multiple spatial and temporal scales account for an increasing proportion of
ecological and environmental research (e.g., Michener et al. 2001, Andelman et al. 2004). Such
syntheses can help to identify and address the “big” ecological questions (Lubchenco et al. 1991,
Belovsky et al. 2004) and contribute to the setting of local, regional, national, and global
environmental policies (Schemske et al. 1994, IPCC 2001, Kareiva 2002). Synthesis is the
raison d’etre of the National Center for Ecological Analysis and Synthesis (NCEAS); barely a
month passes without the appearance of one or more publications by NCEAS working groups.
At its heart, synthesis involves intellectual creativity – asking crosscutting questions and
confronting existing paradigms from new standpoints. But without interpretable and well-
documented data (Michener et al. 1997, Jones et al. 2001), meaningful synthesis cannot proceed
(Andelman et al. 2004). Fortunately, efforts are underway to develop data documentation
(metadata) tools (Michener 2000, Helly et al. 2002) that are intended to increase the chances that
publicly available datasets are used appropriately.
These are pressing issues. Funding agencies increasingly mandate that datasets obtained
with public funds be made available via the Internet with few restrictions. In general, it is up to
individual investigators to meet federal laws and the directives of funding agencies. The
Ecological Society of America makes allowances for researchers to archive data associated with
papers published in its journals1, but has no codified policy on permanent data accessibility or
1 Ecological Archives: http://esapubs.org/Archive/
A. M. Ellison et al. - 4
archiving. In contrast, the Long-Term Ecological Research (LTER) network insists that the vast
majority of data collected at LTER sites be made available within three years of collection.2
Ecologists not only create synthetic datasets from existing datasets, but also increasingly
rely on real-time or near real-time data streams from automated sources. Examples include
climatic data, satellite imagery, measurements of energy, nutrient or gas fluxes, gene sequences,
and data expected from the nascent National Ecological Observatory Network (NEON). Both
synthesized data compilations and real-time data streams generate enormous datasets (Table 1)
that present novel challenges for visualization, interpretation, and analysis (see, for example
(Wegman 1995, Buja et al. 1996, Billard and Diday 2003). These challenges are posed not so
much by storage needs – the current generation of desktop computers can store gigabytes or
terabytes of data – as by difficulties arising from the software domain. The most obvious
challenges are those imposed by the costs of processing, visualizing, transporting, and analyzing
the data, many of which scale polynomially or exponentially with dataset size (Wegman 1995).
A less obvious but perhaps more important challenge is to assure that the analytical
processing of the data is sufficiently well documented that subsequent users of processed data:
(1) do not misuse or misinterpret the data; (2) can repeat the analyses with the same or
alternative datasets; or (3) can knowingly apply variants of the processes. Increasingly, data are
readily available to users worldwide, and the risk that such data or their analyses will be misused
is growing rapidly. It is a potential boon to science that collaborations among geographically
distributed researchers can be facilitated by easy access to scientific datasets, and this boon
creates unprecedented opportunities for participation in the active conduct of science by large
2 http://www.lternet.edu/data/netpolicy.html
A. M. Ellison et al. - 5
groups of individuals and communities (e.g., the Globus3 and GriPhyN4 projects). Indeed,
distributed data processing among hundreds or thousands of computers linked by the Internet has
significantly accelerated analysis of very large datasets (Foster and Kesselman 1999, Reinefeld
and Schintke 2002, Aloisio and Cafaro 2003, Krishnan 2004, Li et al. 2004), leading to faster
and better scientific investigations.
Such expedited access to data creates opportunities for comprehensive analysis and
synthesis. However, it also increases uncertainties in the analysis (Regan et al. 2002) that
highlight three main concerns that we are addressing with the work described here. These
concerns are (1) the reliability of scientific datasets and their associated results; (2) the
reproducibility of scientific results; and (3) the repeatability of these results. While we focus here
on analysis of and documentation of processes applied to large datasets, we emphasize that the
techniques we describe apply to datasets of any size.
WHAT, WE WORRY?
It is important to recognize that the datasets on which scientific publications are based
often are far removed from the raw data collected in the laboratory or in the field. In general,
these datasets are derived from a sequence of scientific processes, including: sampling; data
checking and cleaning (quality assurance/quality control, or QA/QC); variable transformations;
statistical model construction; and statistical inference and evaluation (Gotelli and Ellison 2004).
Thus, our first concern is that highly processed datasets and the results derived from them may
3 http://www.globus.org4 http://www.griphyn.org
A. M. Ellison et al. - 6
be unreliable because incorrect or misleading conclusions were drawn from mishandled or
misinterpreted datasets.
Even when metadata are available, complete details of the actual processes, including
software used in the analyses and specifics of the analytical routines and algorithms used,
generally are not available or are poorly documented. Scientists often carry out complex sets of
processes in creating usable or publishable data. But if other scientists cannot determine
definitively the ways in which these processes were executed, there is a strong possibility that
subsequent use of the data will result in uncertain, incorrect, or misleading conclusions (Esser et
al. 2000). The problem of dealing with inappropriate use of data and scientific processes is
challenging even within a team of collaborating scientists in a single laboratory. As wide access
to scientific data increases, the risk of incorrect or misleading conclusions drawn from them will
increase. For example, historical maps are routinely digitized (e.g., Hall et al. 2002), and the
resulting computer-processed (GIS) maps may look so accurate and believable that they are used
uncritically despite the existence of metadata that include accuracy assessments and disclaimers.
Similarly, the uncertainty inherent in ecological forecasting, including predictions of climate
change and resource depletion, allows for a wide range of mutually incompatible conclusions
(IPCC 2001, Lomborg 2001, Bailey 2002), whereas policy decisions depend upon identifying the
correct conclusion.
Secondly, we are concerned that it will become increasingly difficult to reproduce results
if the processes leading to these results are unavailable or poorly documented. Reproducibility is
a fundamental requirement of all science; phenomena observed by one scientist must also be
observable by others. Reproducing results is greatly complicated when the phenomenon of
interest is manifest only through datasets that have been combined and processed through
A. M. Ellison et al. - 7
complex sequences of computer-based tools and procedures. Available datasets must be
accompanied by metadata (Michener et al. 1997) sufficient to allow the data to be distinguished
from other datasets with which they might be confused and to permit data files to be correctly
read and interpreted (Andelman et al. 2004). For example, sediment cores are used routinely to
reconstruct past climates. Some lakes, such as Elk Lake, Minnesota, have been cored repeatedly
by different teams of investigators (Bradbury and Dean 1993, Smith et al. 1997), resulting in a
comprehensive dataset that can be profitably synthesized because comprehensive, standardized
metadata are available.5
The final concern is that the results and analysis may not be repeatable, even by the
original group of scientists that collected and analyzed the data. Repeatability requires that the
process initially used to generate published results, including the tools and algorithms employed
by that process, must be available or well documented. Modifications to any of these processes
may be inadvertent, as when a software package is updated or the underlying operating systems
are modified. Lacking awareness of these modifications, synthesis may proceed under the
incorrect assumptions that the original process is being used. If changes have been made, then
the original scientific process may not be repeated, leading either to different results or to the
false conclusion that confirmation of prior results has occurred.
Our view is that all three concerns – reliability, reproducibility, and repeatability – can be
significantly addressed through the use of clear, comprehensive, and accurate documentation of
scientific processes, using formal process definition technologies. Such documentation will
ensure that the data and processes are usable by scientists in ways that facilitate the reliability,
reproducibility, and repeatability of scientific results. Our proposed solution builds upon
5 World Data Center for Palaeoclimatology: http://www.ngdc.noaa.gov/paleo/paleo.html
A. M. Ellison et al. - 8
advances in electronic communication and computer science technology to ease and clarify,
rather than impede and complicate, scientific discovery.
We refer to our solution as an analytic web. An analytic web consists of collections of
representations (in our case, annotated graphs) describing the scientific processes employed, the
datasets used and derived, and the relationships among them. These representations may include
links to physical objects (e.g., URLs to reference data or to other analytic webs) and, thus,
describe graphs of graphs or “webs” of data and process descriptions. To help scientists create,
use, and archive analytic webs, we are developing a toolset, called SciWalker, to support the
definition, execution, evaluation, and reuse of analytic webs. Before describing an analytic web
in detail, we illustrate the need for it with a motivating example.
AN EXAMPLE ILLUSTRATES THE NEED FOR AN ANALYTIC WEB
The increasing atmospheric concentration of carbon dioxide (CO2) and its relationship to
global temperature are well known (IPCC 2001). General circulation models (Cramer et al. 2004,
Meehl et al. 2004) and direct measurements of carbon exchange between ecosystems and the
atmosphere (Barford et al. 2001) provide data that are used to: estimate whether different
ecosystems are sources or sinks for CO2; predict how changes to these ecosystems will alter
atmospheric CO2 levels; and forecast how these estimates and predictions will change under
different climate change scenarios (Wigley and Raper 1992, IPCC 2001). Both direct
measurement of carbon exchange and synthetic modeling of whole ecosystems are relatively
new, complex tasks that require large amounts of data from diverse sources. These datasets are
processed with a range of statistical and mathematical models, and the models can provide
A. M. Ellison et al. - 9
continuous estimates of CO2 exchange rate in near real-time. Data posted to the World Wide
Web following publication are highly processed,6 and the reproducibility and repeatability of
these processes contributes to the reliability of these estimates, predictions, and forecasts.
Measurements of ecosystem CO2 flux
Long-term measurements of carbon exchange of whole ecosystems began in the mid-
1980s to early 1990s and primarily have used the eddy covariance method (Baldocchi et al.
1988, Wofsy et al. 1993, Hollinger et al. 1994, Barford et al. 2001). Briefly, this method
estimates CO2 flux from the covariance of CO2 concentration and vertical wind velocity
(Baldocchi et al. 1988). The estimates are accurate if three conditions are satisfied:
1. The concentration of CO2 in air moving horizontally across the surface must be relatively
constant in time. Any change in its concentration due to horizontal air movement must be
adjusted for.
2. All vertical air motions must be accounted for. Since air can move in very small vertical
eddies, which complete a circular motion in a few seconds, vertical velocity and the
concentration of CO2 must be measured at least several times per second.
3. Air parcels near the surface must reach the measurement point by vertical motion
(convection) more rapidly than they leave the area by horizontal motion (advection).
Barriers between the ground surface and the measurement point may exist, for example a
forest canopy between the ground and a measurement point above the canopy. These
barriers must be overcome by a combination of sufficient turbulence between the
6 e.g., Worldwide Fluxnet network: http://daac.ornl.gov/FLUXNET/
A. M. Ellison et al. - 10
measurement point and the surface, and sufficient buoyancy in air parcels near the
surface.
During daylight hours, forest canopies rarely present a serious barrier to accurate eddy
covariance measurements, because solar heating of the ground surface heats the air near the
surface, creating both buoyancy and rising air, or vertical convection. At night, however,
convection is much weaker and the ground may become as cold or colder than the air above it,
creating a layer of stable or even sinking air above the surface. Due to these problems, accurate
eddy covariance measurements may be unavailable for 40% to 75% of nighttime hours,
depending on the location (Wofsy et al. 1993, Goulden et al. 1996, Barford et al. 2001, Saleska
et al. 2003, Hollinger et al. 2004). Eddy covariance data also may be unreliable during daytime
periods with weak turbulence and low buoyancy of air near the ground – these are particularly
common near dawn and dusk. Lastly, data may be lost during equipment calibrations,
maintenance or malfunctions. Processes for filling in (interpolating) these data gaps must
produce unbiased estimates that are as accurate as possible. In publications with major policy
implications (Wofsy et al. 1993, Goulden et al. 1996, Barford et al. 2001, Saleska et al. 2003),
highly processed data, including the interpolated data, are often reduced to a single graph and
only the processed data are publicly available.7
Currently, geographically and temporally extensive eddy covariance data are being
synthesized to assess the impact of many different physiographic and climatic variables on
carbon storage and carbon flux. These large-scale estimates of carbon storage and carbon flux
are used by decision-makers to set policies regarding emissions, carbon sequestration, and
carbon credits and offsets (Brown 2002, Saunders et al. 2002, Bettelheim and D'Origny 2002).
7 For example, see http://www-as.harvard.edu/data/nigec-data.html
A. M. Ellison et al. - 11
However, myriad investigators collect these large datasets, and gap-filling processes differ
among investigators. Consistent processes for data acquisition and manipulation must be used if
aggregations or syntheses of the data will be able to identify real effects and make meaningful
predictions. Differences among datasets resulting from variation in data collection and
manipulation processes can impede reliable forecasts, especially if these differences are
unknown or unaccounted for. For example, eddy covariance measurements in the Amazon basin
in the 1990’s were used to conclude that Amazon forests were storing 1 to 5 Mg C ha-1 yr-1
(Grace et al. 1995, Grace et al. 1996, Malhi et al. 1998, Carswell et al. 2002). More recently
Saleska et al. ( 2003) found an annual net carbon loss of between 0 and 2 Mg C ha-1 yr-1 from
two Amazonian forests and concluded that earlier estimates of substantial C storage may have
been caused by inclusion of data from nighttime periods when low turbulence was reducing the
measured carbon efflux. The result may have been an underestimate of nighttime C release and
an overestimate of annual C storage. If an accurate and easily understandable description of the
processes that were actually used were available, then the source of this discrepancy would be
easier to determine and may have even been avoided.
A prose description of the data-acquisition process
We measure east-west (u), north-south (v), and vertical (w) wind velocity components at
5 Hz with a sonic anemometer (Campbell Scientific, Logan, Utah) mounted on a tower extending
approximately 5 m above the tree canopy (Hadley and Schedlbauer 2002). Water vapor (q) and
CO2 (c) mixing ratios at the location of the sonic anemometer are simultaneously measured at 5
Hz by a closed path infrared gas analyzer (Licor Inc., Lincoln, Nebraska). A laptop computer
collects both data streams and computes ten-minute running means of each of these six tower
A. M. Ellison et al. - 12
variables. Every 30 minutes, the computer calculates mean values of u, v, w, q, and c and the
covariance of u, v, q, and c, with deviations from the running mean of w. A data logger at the
same site measures environmental characteristics that have significant effects, including
photosynthetically active radiation (PAR), air temperature, soil temperature, atmospheric
humidity and soil moisture. These environmental variables are measured every 30 seconds and
averaged every 30 minutes.
Before the flux of CO2 can be estimated, the 30-minute means and covariances of the
tower variables are checked by the investigator to determine which observations are associated
with atmospheric conditions under which measured CO2 flux is not limited by turbulence or
weak vertical air movement. This is most easily done for data collected at night when turbulence
is low and the probability of inadequate vertical air motion is greatest; for illustrative purposes,
we describe the process only for nighttime CO2 flux data.
First, we discard observed values of CO2 flux if the wind direction is unsuitable for flux
measurements. A particular wind direction may be unsuitable because local topography in a
given direction creates unpredictable turbulence patterns or because the forest of interest does
not occur over a sufficient fetch in that direction.
Second, we examine the relationship between friction velocity, u* (a measure of
turbulence [in m/s], which equals the square root of vertical momentum flux) and CO2 flux for
several weeks of nighttime measurements. Flux is plotted against u* (Fig. 1), and a threshold
value of u* (u*threshold) is identified beyond which CO2 flux does not increase significantly.
Observed values of CO2 flux are discarded if u* < u*threshold. If data from all wind directions are
suitable, the u*threshold criterion typically results in the discarding of <50% of the nighttime
A. M. Ellison et al. - 13
observations of CO2 flux. On the other hand, if some wind directions are unsuitable, >75% of the
nighttime observations may be rejected.
Next, we fill the gaps in the dataset that result from discarding observed values of CO2
flux by estimating the values that would have been observed if u* ≥ u*threshold. To fill these gaps,
we fit regression models of the reliable observations (CO2 flux | u* ≥ u*threshold) to the measured
environmental variables. For nighttime observations, the predictor variables (identified using
stepwise multiple regression) are soil temperature, air temperature and occasionally soil
moisture, which can have large effects on soil respiration when soil temperatures are relatively
warm (spring through fall; see Savage and Davidson 2001).
AN ANALYTIC WEB DEFINED
Although this prose description summarizes the scientific processes used to process the
raw data, such prose descriptions lack the precision and completeness needed to ensure
reliability, reproducibility, and repeatability. These goals are far more achievable through
formalisms developed for engineering of computer software. These formalisms can be used to
create analytic webs: clear, comprehensive, and precise definitions of the scientific processes
used to process raw and derived data. More generally, an analytic web is a coordinated set of
formal representations of key aspects of a scientific process, which together provide a clear,
precise, and comprehensive definition of the process. We suggest that such analytic webs would
be effective in supporting the synthesis of ecological data.
A. M. Ellison et al. - 14
AREN’T THE METADATA SUFFICIENT?
Metadata comprise “the higher level information or instructions that describe the content,
context, quality, structure, and accessibility of a specific data set” (Michener et al. 1997). In
recent years there have been remarkable advances in the development of standards and tools for
creating and using ecological metadata (Andelman et al. 2004). For ecologists, perhaps the most
important of these has been the creation of a standard for structured metadata, Ecological
Metadata Language (EML). EML is a modular and flexible application of Extensible Markup
Language (XML)8 developed by the ecological community and designed to support data
discovery, access, integration, and synthesis.9 A sample EML file for the eddy flux experiment
described above can be viewed on the Harvard Forest web page.10
EML facilitates data discovery by encoding general information about datasets, such as
creator, title, abstract, keywords, and geographic, temporal, and taxonomic coverage. Access to
data is facilitated through specific information on the location (e.g., its uniform resource locator
[URL] or digital object identifier [DOI]) and the physical characteristics (e.g., file name, coding,
record delimiters, and field delimiters) of data files. Integration and synthesis are facilitated
through detailed information on dataset content (e.g., name, unit, precision, range, and number
type for each variable in a table). Retrieval of specific items of information is facilitated by the
structured nature of EML files, which are interpretable and searchable (parsable) by computers.
EML provides two modules for the documentation of methods used in the creation of a
dataset. The protocol module is used to describe an established field, laboratory, or analytical
8 http://www.w3.org/XML/9 http://knb.ecoinformatics.org/software/eml10 http://harvardforest.fas.harvard.edu, dataset HF072
A. M. Ellison et al. - 15
procedure – essentially a standardized method. The methods module is used to describe the field,
laboratory, and analytical procedures that were actually used in the creation of a particular
dataset, and may refer to one or more relevant protocol modules, if available. The protocol
module is prescriptive, while the methods module is descriptive. Both modules have provisions
for specifying individual steps and associated citations, instrumentation, software, and quality
control, but are otherwise essentially narrative in content. The prose description above of the
process used to clean, model, and gap-fill the eddy covariance data and estimate CO2 flux is
suitable for inclusion in the methods module of an EML file. Individual steps in a process can be
expressed in EML through objects called methodStep elements.
The current version of EML (2.0.0) thus is able to capture process metadata in narrative
or sequential-narrative form. However, this may not be sufficient to allow the processed data to
be reproduced from the original data if the analytical steps involved are complex or if they use
structures such as conditionals or iteration. If not precisely stated, such complexity can create
ambiguity. In addition, some combinations of data processing techniques may lead to statistical
or logical errors. For example, further smoothing of data that have already been smoothed may
suppress important variation. Rounding errors and transcription errors may also result in the
publication of incorrect or misleading results (Garcia-Berthou and Alcaraz 2004).
We propose that every dataset generated by a research project should have attached to it
process metadata that formally describes the processes by which the data were derived, including
the sequence of tools, techniques, and intermediate datasets used. Such metadata would be a
useful complement to existing metadata standards such as EML, and in fact could be
incorporated directly into EML files. Our research has convinced us that such an approach could
have substantial value in helping scientists understand and share datasets more safely and
A. M. Ellison et al. - 16
effectively. While many scientists currently describe their analytical processes only informally
and partially, there is growing interest in using formal notations, such as dataflow and workflow
graphs, to document these processes (e.g., Ailamaki et al. 1998, Altintas et al. 2004a, 2004b).
Our research expands on these efforts in an attempt to use more expressive formalisms to capture
the subtleties and complexities of, and to remove the potential ambiguities in, the descriptions of
the processes used by ecologists and other scientists.
FORMAL DESCRIPTION OF AN ANALYTIC WEB
An analytic web is a coordinated collection of three types of graphs – a dataflow graph, a
dataset derivation graph, and a process definition graph – all of which were originally developed
for use in defining and controlling software development projects (e.g., Ghezzi et al. 2003). We
propose to use these three graphs to formalize different aspects of ecological data processing in
such a way as to assure its reliability, reproducibility, and repeatability.
A dataflow graph (DFG) clearly defines type information: which types of datasets are
acted upon by which types of processes (tools, activities) in order to produce other types of
datasets. A DFG documents the relationships among dataset types and process types. For
example a process type of “interpolate via linear regression” could be applied to a dataset type
of “rainfall data” and the DFG would indicate the conditions under which this interpolation
would be applied. Like a recipe in a cookbook, the clarity and comprehensibility of a DFG
facilitates the reproduction of scientific processes.
The type information in a DFG is used to assure that the appropriate kinds of processes
are being applied to the appropriate kinds of data. For example, the process type could specify
A. M. Ellison et al. - 17
the exact kind of linear regression algorithm or the particular statistical package to be applied.
Similarly, the dataset type could indicate the type of data, as noted above, but could also go
further and indicate the precise format (syntax) of each field or the access methods (semantics)
that must be associated with the data, as found in object-oriented type definitions. The type
information is used for checking type conformance; it should be possible to incorporate
automatically the structural transformations used to transform datasets of one type to another
(Bowers and Ludäscher 2004).
In contrast to the type relationships represented in a DFG, dataset derivation graph
(DDG) documents the specific instances of datasets produced by the actions of specific
processes operating on other specific datasets. Dataset instances documented by DDGs are
uniquely specified and are the usual focus of attention in reproducing and repeating scientific
processes. Examples include “rainfall data collected at the Harvard Forest on 1 June 2004 at
hourly intervals”, or “SAS version 6.1 of the non-linear regression using nls2 (Huet et al.
2004)”. Whereas a DFG is the recipe found in the 10th edition of the Joy of Cooking, the DDG is
the result: a spiced chocolate prune cake baked for Jane’s 60th birthday. Reproducibility and
repeatability demand the documentation provided by the DDG, namely the specific datasets and
tools that were actually used.
For simple scientific processes, a DFG and DDG together may be sufficient. For more
complex processing, especially processes where exceptional conditions might arise that require
special attention, a richer notation is more effective. For such cases, a process definition graph
(PDG) is used to represent complex procedural detail. The PDG emphasizes the specification of
the full range of possible activities that might be taken, such as those that happen in response to
unexpected or undesired occurrences. Continuing the cookbook analogy, a PDG could include a
A. M. Ellison et al. - 18
specification of what to do in case the egg yolks curdle while beating up an egg-based sauce, or
steps that might be taken in case a particular specified type of ingredient is either unavailable, or
is of insufficient quality. The PDGs associated with a DFG form the basis for reasoning about
the possibility of applying incompatible statistical methods, or using inconsistent or incompatible
datasets with one another.
We describe the representations of these three types of graphs and illustrate their use in
processing the eddy covariance data to yield an estimate of nighttime carbon flux.
Dataflow graphs
Dataflow graphs (DFGs) specify the precise ways in which the various tools and
subprocesses are used to derive new datasets. In a DFG (Fig. 2a), the nodes represent the various
tools, activities or datasets and the edges represent the flow of datasets into and out of these
processes. A DFG supports the definition of additional process details such as parallelism and
iteration. The DFG defined here uses different icons to differentiate processes (ovals in Fig. 2a)
from datasets (boxes in Fig. 2a).
Data derivation graphs
A data derivation graph (DDG; Fig. 2b) keeps track of specific datasets that have been
derived by the actions of tools operating on (other) specific datasets. In a DDG, a node (clipped
box) represents each actual dataset instance created by executing a process. Each node is
connected by an edge (arrow) to the dataset(s) from which it was derived. The edge is annotated
(clipped oval) with a specification of the specific tool instance (e.g., the exact version of a
A. M. Ellison et al. - 19
statistical routine) or sub-process instance (e.g., a reference to another DFG or PDG) used for the
derivation.
A DDG never has cycles, but grows as more process steps are executed and datasets are
created. The most recent instance of a dataset is appended to the DDG by an edge (or edges),
representing how it was derived from previously derived datasets. No datasets are ever destroyed
or supplanted in the DDG. They are simply created and left available for subsequent use.
A DFG describes how types of datasets can be created, whereas the DDG describes how
specific instances were actually derived. A DFG integrated with a DDG allows for the
identification of the sequence of processes and the datasets created by their application. For
example, although a DDG has no cycles, a DFG uses cycles to represent iteration – the
application of a specific sequence of processing steps to successive datasets. Thus, the DFG can
be viewed as a specification of a generator that builds ever-growing DDGs, creating an
interpretable audit trail of process executions. To ensure reproducibility, it is important to
preserve rather than overwrite datasets produced by such iterative processes.
As noted, there is a relationship between the instances represented in a DDG for a process
execution and the DFGs that were executed to create those instances. We have found that it is
useful to provide a view of the process that combines these two representations by imposing a
representation of the created instances onto the DFGs for that process. Since there may be many
instances of a dataset associated with a node that represents the creation of a particular dataset
type, we provide an alternative view of a DFG overlain on a DDG that “stacks” icons
representing instances behind the icon representing the particular dataset type (see Fig. 6, below).
A. M. Ellison et al. - 20
Process definition graphs
A process definition graph (PDG; Fig. 2c) provides a more detailed view of the process
than does a DFG. Although a DFG could be used to define any process, experience has shown
that this representation is not well suited for representing complex processes that require, for
example, conditional decisions and human intervention. Such a complex process could be
abstractly represented as a single activity node (oval) in a DFG, but then could be elaborated as a
PDG. The PDG describes procedural details such as the order in which steps must be taken,
where parallelism is possible, pre-condition and post-condition checks to determine whether or
not processing has been successful, actions to try when various exceptional conditions occur, and
conditions under which processing is to be iterated or terminated. In short, the PDG specifies
detailed procedural flow.
A PDG specification (Fig. 2c), drawn in the style of the process definition language
Little-JIL (Wise 1998), represents processes as decomposition hierarchies of steps. Each step is
represented by a cluster of icons either within or around a rectangle, called the Step Bar. The
triangles outside the Step Bar are iconic representations of optional pre- (left of the Step Bar) and
post-condition (right of the Step Bar) steps used to control and monitor the execution of the step.
When these triangles are not filled in, then there are no pre- or post-conditions specified, as in
this figure. The leftmost icon inside the Step Bar indicates the sequencing order of the sub-steps.
In Fig. 2c, the leftmost icon is an arrow, which indicates that this step is processed in sequential
order by executing its (hierarchical) sub-steps in order from left to right. There are also icons to
represent parallel execution and non-deterministic selection, among others. The sub-steps
associated with the rightmost icon (X) inside the Step Bar, specifies the behavior to be followed
if exceptional conditions arise.
A. M. Ellison et al. - 21
In Figure 2c, the process step labeled Summarize Data is defined as the composition of
three substeps, Process Item, Compute Summary, and Fix Problem. Process Item and Compute
Summary are associated with normal process flow, which is specified by the sequence icon (e.g.,
an arrow, representing sequential left to right flow). Process Item has variable cardinality, as
represented by the asterisk on its incoming edge, indicating that zero or more instances of this
step will need to be executed, depending on the number of items available to be processed. Since
this is a sequential step, each instance of Process Item will be completed before the next one is
initiated and Compute Summary will only start after the last (and thus all) instances of Process
Item have completed execution. The sub-step Fix Problem is associated with exceptional process
flow, as indicated by the X icon in the Step Bar, and will only be invoked if the exception Fix
Problem is raised by any of the sub-steps associated with process step Summarize Data.
Although not shown in Figure 2c, the edges connecting a higher-level step to a lower-
level step may have associated annotations that define the datasets that flow back and forth
between steps, and a step may include a specification of the resources needed for its execution. A
principal resource to be specified for each step is the agent responsible for the step’s execution.
This agent may be hardware, software, and/or human. In all cases, it is necessary to detail the
nature of the agent and other resources needed.
Applying an analytic web to the eddy flux data
We illustrate the application of an analytic web through the example of examining the
effect of varying u*threshold on the estimate of nighttime carbon derived from eddy covariance
data. The date we used were collected in 2000 and 2001 in a hemlock forest in central
Massachusetts (Fig. 1; Hadley and Schedlbauer 2002). Because the tower was sited near the
A. M. Ellison et al. - 22
northeast corner of a hemlock stand, only data collected when the wind blew from the southwest
(180–270º E of N) were usable for estimating carbon flux from the hemlock stand. As described
above (A prose description of the data-acquisition process), we therefore first discarded data
collected when the wind was blowing from any direction other than the southwest (i.e., 0–179º E
of N and 271–359º E of N). The application of the u*threshold criterion also resulted in discarding
some of the data even when the wind was from the southwest. We used a linear regression model
using data from independently measured environmental variables (e.g., soil temperature, soil
moisture) to fill the resulting gaps in the dataset (Hadley and Schedlbauer 2002). Estimated
carbon flux increased non-linearly with the value of u*threshold, and it appeared that C flux
approached an asymptote, whereas the fraction of data retained decreased linearly with u*threshold
(Fig. 1). This result is quite similar to that of Hollinger et al. (2004) who analyzed four years of
nighttime eddy covariance data in a coniferous forest in Maine, USA.
The dataflow graph (DFG) for creating usable datasets from the eddy covariance data
(Fig. 3) includes dataset types Tower Data, Environmental Data, Selection Criteria, Aggregated
Data, Excluded Data, Rejected Data, Selected Data, Interpolated Data, and Row-Filled Data. As
models and criteria are also fixed data types, the Interpolation Model and the Selection Criteria
are also represented as a box. Processes, represented by ovals, include Create Aggregated Data,
Segregate Data, Create Interpolation Model, Apply Interpolation Model, and Merge Datasets
and represent types of actions that are to be applied to particular types of data. Note that a
process may require more than one input; for example, Interpolation Model and Excluded Data
are both required for the process Apply Interpolation Model. A dataset may also be input to more
than one process. The diamond indicates that the same instance of data type Selected Data is
input to both the Create Interpolation Model and Merge Data processes.
A. M. Ellison et al. - 23
Figure 4 illustrates a modest enhancement of the DFG shown in Fig. 3 in which the
criteria used to partition the data may be modified after examining the results of interpolating
and merging the data (the Revise Selection Criteria process uses the formerly final data type
Row-Filled Data). This iterated DFG (Fig. 4) specifies that new selection criteria will be created
based on examination of the Row-Filled Data and that these criteria may be applied to existing
datasets, generating new Row-Filled Data. The process indicates that this iteration might be
repeated, causing the successive generation of new criteria and new output data. Thus, this
process might produce a sequence of criteria and a sequence of datasets. It is vital that there be a
firm connection between each dataset and the model from which it was created.
The data derivation graph (DDG), shown in Figure 5, resulting from the execution of the
DFG shown in Figure 4, illustrates the connections between each instance of the dataset and the
process that created it. This DDG depicts the state of the execution of the DFG at the beginning
of the second iteration. As in Fig. 2, ovals with clipped corners in Fig. 5 denote specific dataset
instances. Each clipped box represents the dataset derived by the application of the activity from
which it emanates to the dataset(s) input to that activity.
Rather than depicting DDGs independently, the stacked view (Fig. 6) shows the
relationship between a dataset type in the DFG and each dataset instance created during
execution. This representation more clearly shows the specific dataset instances that have been
created in successive (iterative) applications of an analytic web.
Although the DDG and the stacked view provide a clearer view of how certain datasets
have been created, metadata are still needed to clarify the origins and particulars of the datasets
(particularly the Tower Data and the Environmental Data) and the Interpolation Model. Such
A. M. Ellison et al. - 24
metadata are essential for the repeatability and reproducibility of the final results. EML readily
accommodates this kind of metadata.
Just as dataset metadata provides additional information about a dataset instance to
ensure reproducibility and repeatability, more detailed process information is also sometimes
needed to carefully delineate the actions and decisions associated with a process. We currently
support two ways of providing this detail. One is by decomposing any DFG activity node into a
more detailed DFG. This hierarchical decomposition is a good approach for modeling the
relationship between a more abstract concept and its more detailed representation. Such
decomposition can be repeated to any level of detail. Alternatively, the details associated with a
DFG activity node can be represented by a process definition graph (PDG).
A PDG of the DFG shown in Figure 4 illustrates this point (Fig. 7). Without delving
deeply into the syntax and semantics of Little-JIL (see Wise 1998 for a complete description),
there are two details of particular import in this diagram. First, the filled triangle on the left side
of the Step Bar for Create Interpolation Model indicates a guard or pre-condition on this step.
This pre-condition will assure that the dataset is of sufficient size to create the model. Second,
the exception handler (X) associated with the step Create Row-Filled Data specifies that if this
pre-condition fails, then the process step associated with this handler will intercede to change the
selection criteria. Without support for exception handling, it is more difficult to encode repair
activities.
A. M. Ellison et al. - 25
EXPERIMENTAL EVALUATION THROUGH THE SCIWALKER TOOL
We have developed a tool, called SciWalker, which uses the process and data
derivation representations (DFGs, DDGs, and PDGs) to support the definition and execution of
analytic webs. The instances in a DFG or PDG supported by SciWalker are all accompanied
by specific metadata annotations, viewed through clickable menu items that provide specific
information about how they were generated. The process metadata generated by SciWalker
are easily adapted for inclusion in XML applications such as EML (Table 2).
By using SciWalker to automate the process of estimating nighttime carbon exchange
from eddy covariance data, we have been able to quickly determine the effect of varying
u*threshold on estimated nighttime C flux from a forest. Simultaneously, we used SciWalker to
create the DDG of this process, which provides a comprehensive audit trail that is easily
accessible via the Internet. With this audit trail, data that were included or excluded can be easily
retrieved and examined. Others have examined effects of u*threshold on estimates of carbon flux
(Barford et al. 2001, Saleska et al. 2003, Hollinger et al. 2004), but neither the sets of accepted
and excluded data, nor the gap-filling models used in these studies, are readily accessible by
other researchers. Using an analytic web is a step forward both in ease and speed of data
processing and analysis for us as individual researchers, and also a great leap forward in
communicating our data analysis procedures to others.
While influences of u*threshold on ecosystem carbon flux estimated from eddy covariance
data have been examined in several papers, effects of other meteorological variables have been
examined less frequently. Wind direction is of particular interest, because forest composition is
rarely uniform around an eddy covariance tower. In general, one cannot relate carbon flux to a
A. M. Ellison et al. - 26
specific type of forest without limiting the range of wind directions that provide acceptable data
for processing. As carbon exchange estimates and statistical models of carbon exchange for one
forest cannot be applied to other forests unless the forest composition is similar in the two areas,
it is often important for researchers to partition eddy covariance data by wind direction in order
to confine measurements to a specific forest type. The Analytic Web provides an approach to do
this, by allowing for the specification of the range of wind direction for included versus excluded
data. In the case of our estimates of C exchange by a hemlock forest, we included data only if the
winds were from the southwest (180-270o compass bearing) because of the relatively small size
of the hemlock forest we were studying, and the position of the eddy covariance tower in the
northeast corner of hemlock-dominated forest. Using SciWalker, we are now examining the
effects of using other ranges of wind direction, thereby including other forest types within the
eddy covariance footprint.
DISCUSSION AND FUTURE DIRECTIONS
There is an important need to develop tools and techniques that will support repeatability,
reproducibility, and reliability of the scientific processes used to analyze ecological data. Our
proposal for the use of software engineering technology to create analytic webs as syntheses of
three types of graphs seems quite promising, based upon our initial work with the SciWalker
prototype and its application to eddy flux data used to estimate whether forests are sources or
sinks of CO2. In its current implementation, SciWalker has not yet integrated the PDGs with
the DFGs and DDGs. However, even without the full capabilities provided by PDGs, our
application of SciWalker already has led to new scientific insights and interesting new results.
A. M. Ellison et al. - 27
Future versions of SciWalker will incorporate the PDGs and support assessment of
process reliability. It is our goal that scientific analyses eventually be accompanied by process
metadata derived from the (presumably successful) application of formal process analyzers to
our process metadata. These certifications would then be usable by other scientists to guide them
away from dangerous misuse of datasets or combinations of processes. The net result will be
science that is not only more rapid and efficient, but also more safe and sound.
A. M. Ellison et al. - 28
ACKNOWLEDGMENTS
This work was supported by NSF grant CCR-0205575 and is a contribution from the Harvard
Forest Long Term Ecological Research (LTER) program. We thank Elizabeth Farnsworth, Nick
Gotelli, Matt Jones, and Kristin Vanderbilt, for critical comments on drafts of the manuscript.
A. M. Ellison et al. - 29
LITERATURE CITED
Ailamaki, A., Y. E. Ioannidis, and M. Livny. 1998. Scientific workflow management by database
management. Proceedings of the 10th international conference on scientific and statistical
database management. Capri, Italy.
Aloisio, G., and M. Cafaro. 2003. A dynamic earth observation system. Parallel Computing
29:1357-1362.
Altintas, I., C. Berkeley, E. Jaeger, M. Jones, B. Ludäscher, and S. Mock. 2004a. Kepler: an
extensible system for design and execution of scientific workflows. Proceedings of the
16th international conference on scientific and statistical database management. Santorini
Island, Greece.
Altintas, I., C. Berkeley, E. Jaeger, M. Jones, B. Ludäscher, and S. Mock. 2004b. Kepler:
towards a grid-enabled system for scientific workflows. Proceedigs of the tenth global
grid forum - the workflow in grid systems workshop. Berlin, Germany.
Andelman, S. J., C. M. Bowles, M. R. Willig, and R. B. Waide. 2004. Understanding
environmental complexity through a distributed knowledge network. BioScience 54:240-
246.
Bailey R. E. 2002. Global warming and other eco-myths: how the environmental movement uses
false science to scare us to death. Prima Publishing, Roseville, California, USA.
Baldocchi, D. D., B. B. Hicks, and T. P. Meyers. 1988. Measuring biosphere-atmosphere
exchanges of biologically related gases with micrometeorological methods. Ecology
69:1331-1340.
Barford, C. C., S. C. Wofsy, M. L. Goulden, J. W. Munger, E. H. Pyle, S. P. Urbanski, L.
Hutyra, S. R. Saleska, D. Fitzjarrald, and K. Moore. 2001. Factors controlling long- and
A. M. Ellison et al. - 30
short-term sequestration of atmospheric CO2 in a mid-latitude forest. Science 294:1688-
1691.
Belovsky, G. E., D. B. Botkin, T. A. Crowl, K. W. Cummins, J. F. Franklin, M. L. Hunter, Jr., A.
Joern, D. B. Lindenmayer, J. A. MacMahon, C. R. Margules, and J. M. Scott. 2004. Ten
suggestions to strengthen the science of ecology. BioScience 54:345-351.
Bettelheim, E. C., and G. D'Origny. 2002. Carbon sinks and emissions trading under the Kyoto
Protocol: a legal analysis. Philosophical Transactions of the Royal Society of London
A360:1827-1851.
Billard, L., and E. Diday. 2003. From the statistics of data to the statistics of knowledge:
symbolic data analysis. Journal of the American Statistical Association 98:470-487.
Bowers, S. and B. Ludäscher. 2004. An ontology-driven framework for data transformation in
scientific workflows, Pages 1-16 in E. Rahm, editor. Proceedings of the 1st international
workshop in data integration in the life sciences (DILS). Springer-Verlag, Berlin,
Germany.
Bradbury, J. P., and W. E. Dean. 1993. Elk Lake, Minnesota: evidence for rapid climate change
in the north central United States. GSA Special Paper 276:1-336.
Brown, S. 2002. Measuring, monitoring, and verification of carbon benefits for forest-based
projects. Philosophical Transactions of the Royal Society of London Series A360:1669-
1683.
Buja, A., D. Cook, and D. F. Swayne. 1996. Interactive high-dimensional data visualization.
Journal of Computational and Graphical Statistics 5:78-99.
Carswell, F. E., A. L. Costa, M. Palheta, Y. Malhi, P. Meir, J. D. R. Costa, M. D. Ruivo, L. D.
M. Leal, J. M. N. Costa, R. J. Clement, and J. Grace. 2002. Seasonality in CO2 and H2O
A. M. Ellison et al. - 31
flux at an eastern Amazonian rain forest. Journal of Geophysical Research-Atmospheres
107:8076.
Cramer, W., A. Bondeau, S. Schaphoff, W. Lucht, B. Smith, and S. Sitch. 2004. Tropical forests
and the global carbon cycle: impacts of atmospheric carbon dioxide, climate change and
rate of deforestation. Philosophical Transactions of the Royal Society of London
B359:331-343.
Esser, G., H. F. H. Lieth, J. M. O. Scurlock, and R. J. Olson. 2000. Osnabrück net primary
productivity data set. Ecology 81:1177.
Foster I., and C. Kesselman. 1999. The Grid: blueprint for a new computing infrastructure.
Morgan-Kaufmann Publishers, San Francisco, CA.
Garcia-Berthou, E., and C. Alcaraz. 2004. Incongruence between test statistics and P values in
medical papers. BioMed Central Medical Research Methodology 4:13.
Ghezzi C., M. Jazayeri, and D. Mandrioli. 2003. Fundamentals of software engineering, 2nd
edition. Pearson Education, Inc., Upper Saddle River, New Jersey.
Gotelli N. J., and A. M. Ellison. 2004. A primer of ecological statistics. Sinauer Associates,
Sunderland, Mssachusetts.
Goulden, M. L., J. W. Munger, S.-M. Fan, B. C. Daube, and S. C. Wofsy. 1996. Exchange of
carbon dioxide by a deciduous forest: response to interannual climate variability. Science
271:1576-1578.
Grace, J., J. Lloyd, J. McIntyre, A. C. Miranda, P. Meir, H. S. Miranda, C. Nobre, J. Moncrieff,
J. Massneder, Y. Malhi, I. Wright, and J. Gash. 1995. Carbon dioxide uptake by an
undisturbed tropical rainforest in western Amazonia, 1992 to 1993. Science 270:778-780.
A. M. Ellison et al. - 32
Grace, J., Y. Malhi, J. Lloyd, J. McIntyre, A. C. Miranda, P. Meir, and H. S. Miranda. 1996. The
use of eddy covariance to infer the net carbon dioxide uptake of the Brazilian rainforest.
Global Change Biology 2:209-218.
Hadley, J. L., and J. L. Schedlbauer. 2002. Carbon exchange of an old-growth eastern hemlock
(Tsuga canadensis) forest in central New England. Tree Physiology 22:1079-1092.
Hall, B., G. Motzkin, and D. R. Foster. 2002. Three hundred years of forest and land-use change
in Massachusetts. Journal of Biogeography 29:1319-1336.
Helly, J. J., T. T. Elvins, D. Sutton, D. Martinez, S. E. Miller, S. Pickett, and A. M. Ellison.
2002. Controlled publication of digital scientific data. Communications of the
Association for Computing Machinery 45:97-101.
Hollinger, D. Y., J. D. Aber, B. Dail, E. A. Davidson, S. M. Goltz, H. Hughes, M. Y. Leclerc, J.
T. Lee, A. D. Richardson, C. Rodrigues, N. A. Scott, D. Achuatavarier, and J. Walsh.
2004. Spatial and temporal variability in forest-atmosphere CO2 exchange. Global
Change Biology in press.
Hollinger, D. Y., F. M. Kelliher, J. N. Byers, J. E. Hunt, T. M. McSeveny, and P. L. Weir. 1994.
Carbon dioxide exchange between an undistirubed old-growth temperate forest and the
atmosphere. Ecology 56:134-150.
Huet S., A. Bouvier, M.-A. Poursat, and E. Jolivet. 2004. Statistical tools for nonlinear
regression: a practical guide with S-Plus and R examples, 2nd edition. Springer-Verlag,
Inc., New York, New York.
Intergovernmental Panel on Climate Change (IPCC). 2001. Climate change 2001: synthesis
report. IPCC Secretariat, Geneva, Switzerland.
A. M. Ellison et al. - 33
Jones, M. B., C. Berkley, J. Bojilova, and M. Schildhauer. 2001. Managing scientific metadata.
IEEE Internet Computing September/October 2001:59-68.
Kareiva, P. M. 2002. Applying ecological science to recovery planning. Ecological Applications
12:629.
Krishnan, A. 2004. A survey of life sciences applications on the Grid. New Generation
Computing 22:111-125.
Li, W. W., R. W. Byrnes, J. Hayes, A. Birnbaum, V. M. Reyes, A. Shahab, C. Mosley, D.
Pekurovsky, G. B. Quinn, I. N. Shindyalov, H. Casanova, L. Ang, F. Berman, P. W.
Arzberger, M. A. Miller, and P. E. Bourne. 2004. The encyclopedia of life project: Grid
software and deployment. New Generation Computing 22:127-136.
Lomborg B. 2001. The skeptical environmentalist: measuring the real state of the world.
Cambridge University Press, Cambridge, UK.
Lubchenco, J., A. M. Olson, L. B. Brubaker, S. R. Carpenter, M. M. Holland, S. P. Hubbell, S.
A. Levin, J. A. MacMahon, P. A. Matson, J. M. Melillo, H. A. Mooney, C. H. Peterson,
H. R. Pulliam, L. A. Real, P. J. Regal, and P. G. Risser. 1991. The sustainable biosphere
initiative: an ecological research agenda. Ecology 72:371-412.
Malhi, Y., A. D. Nobre, J. Grace, B. Krujit, M. G. P. Pereira, A. Culf, and S. Scott. 1998. Carbon
dioxide transfer over a central Amazonian rain forest. Journal of Geophysical Research
D24:31593-31612.
Meehl, G. A., W. M. Washington, J. M. Arblaster, and A. X. Hu. 2004. Factors affecting climate
sensitivity in global coupled models. Journal of Climate 17:1584-1596.
A. M. Ellison et al. - 34
Michener, W. K. 2000. Ecological knowledge and future data challenges. Pages 162-174 in W.
K. Michener, and J. W. Brunt, editors. Ecological data: design, management and
processing. Blackwell Science, Oxford, UK.
Michener, W. K., T. J. Baerwald, P. Firth, M. A. Palmer, J. L. Rosenberger, E. A. Sandlin, and
H. Zimmerman. 2001. Defining and unraveling biocomplexity. BioScience 51:1018-
1023.
Michener, W. K., J. W. Brunt, J. J. Helly, T. B. Kirchner, and S. G. Stafford. 1997.
Nongeospatial metadata for the ecological sciences. Ecological Applications 7:330-342.
Regan, H. M., M. Colyvan, and M. A. Burgman. 2002. A taxonomy and treatment of uncertainty
for ecology and conservation biology. Ecological Applications 12:618-628.
Reinefeld, A., and F. Schintke. 2002. Concepts and technologies for a worldwide grid
infrastructure. Proceedings of Euro-Par 2002 Parallel Processing 2400:62-71.
Saleska, S. R., S. D. Miller, D. M. Matross, M. L. Goulden, S. C. Wofsy, H. R. de Rocha, P. B.
de Camargo, P. Crill, B. C. Daube, H. C. de Freitas, L. Hutyra, M. Keller, V. Kirchnoff,
M. Menton, J. W. Munger, E. Hammond-Pyle, A. H. Rice, and H. Silva. 2003. Carbon in
Amazon forests: unexpected seasonal fluxes and disturbance-induced losses. Science
302:1554-1557.
Saunders, L. S., R. Hanbury-Tenison, and I. R. Swingland. 2002. Social capital from carbon
property: creating equity for indigenous people. Philosophical Transactions of the Royal
Society of London A360:1763-1775.
Savage, K. E., and E. A. Davidson. 2001. Interannual variation of soil respiration in two New
England forests. Global Biogeochemical Cycles 15:337-350.
A. M. Ellison et al. - 35
Schemske, D. W., B. C. Husband, M. H. Ruckelshaus, C. Goodwillie, I. M. Parker, and J. G.
Bishop. 1994. Evaluating approaches to the conservation of rare and endangered plants.
Ecology 75:584-606.
Smith, A. J., J. J. Donovan, E. Ito, and D. R. Engstrom. 1997. Ground-water processes
controlling a prairie lake's response to middle Holocene drought. Geology 25:391-394.
Wegman, E. J. 1995. Huge data sets and the frontiers of computational feasibility. Journal of
Computational and Graphical Statistics 4:281-285.
Wigley, T. M. L., and S. C. B. Raper. 1992. Implications for climate and sea level of revised
IPCC emissions scenarios. Nature 357:293-300.
Wise A. 1998. Little-JIL 1.0 language report. Computer Science Technical Report, University of
Massachusetts, Amherst, Massachusetts.
Wofsy, S. C., M. L. Goulden, J. W. Munger, S. M. Fan, P. S. Bakwin, B. C. Daube, S. L.
Bassow, and F. A. Bazzaz. 1993. Net CO2 exchange in a mid-latitude forest. Science
260:1314-1317.
A. M. Ellison et al. - 36
Table 1. A taxonomy of data set sizes (after Wegman 1995).
Descriptor Data set size (bytes) Storage
Tiny 102 A piece of paper
Small 104 A notebook
Medium 106 Removable media (e.g., floppy disk)
Large 108 A single hard disk
Huge 1010 Multiple hard disks or RAID arrays
Ridiculous 1012 Robotic tape storage silos
A. M. Ellison et al. - 37
Table 2. Simplified excerpt from an XML file describing the Analytic Web shown in Fig. 3. The
data types for Selected Data and Interpolation Model (boxes in Fig. 3) are declared through
references to associated EML files. The activity Create Interpolation Model (oval) is declared by
including the actual script in the R statistical language. The connections (edges) in the Dataflow
Graph are specified by the <connector> and <binding> elements.
<analyticWeb><type-declaration name="cflux-data">
<attributes>year,doy,u,ustar,wdir,par,tair,dmintair,tsoil,vpd,mco2flux,eco2flux,source</attributes>
<body><metadata object="cflux-data.xml" language="EML" version="2.0.0"/>
</body></type-declaration><type-declaration name="nighttime-model">
<attributes>mco2flux,tsoil</attributes><body>
<metadata object="nighttime-model.xml" language="EML" version="2.0.0"/></body>
</type-declaration><activity-declaration name="create-model">
<input name="data" type="cflux-data"/><output name="model" type="nighttime-model"/><body>
<script language="R" version="1.9.0">model=lm(log(mco2flux)~tsoil, data)
</script></body>
</activity-declaration><dataflow-declaration name="nighttime-cflux">
<data name="selected-data" type="cflux-data"/><data name="interpolation-model" type="nighttime-model"/><activity name="create-interpolation-model" type="create-model"/><connector kind="fork">
<binding from="selected-data" to="create-interpolation-model"/><binding from="selected-data" to="merge-data"/>
</connector><connector kind="simple">
<binding from="create-interpolation-model" to="interpolation-model"/></connector>
</dataflow-declaration></analyticWeb>
A. M. Ellison et al. - 38
Figure 1. Estimated nighttime CO2 flux (solid circles and solid line), and fraction of data used in
the analysis (open circles and dashed line) from April – June 2001 above a central Massachusetts
hemlock forest, for southwesterly winds and for different ranges of friction velocity (u*) (Hadley
and Schedlbauer 2002).
A. M. Ellison et al. - 39
Figure 2. Examples of a data flow graph (DFG), data derivation graph (DDG) and process
definition graph (PDG). In a data flow graph (A) datasets are indicated by rectangles and
processes are indicated by ovals. Edges (arrows) indicate the direction of flow. In this DFG,
datasets of Type 1 and Type 2 are used by Tool A to create a dataset of Type 3. A data derivation
graph (B) illustrates the particular outcome or instance of executing the DFG. Instances of
datasets are indicated by rectangles with one corner clipped off, and instances of applying a
specific tool are indicated by ovals with all four corners clipped off. Edges (arrows) indicate how
a given dataset was derived from a previous dataset. These edges are annotated (dotted lines) to
indicate which process was applied. In this DDG, a particular dataset of Type 3 was derived from
two particular datasets of Type 1 and Type 2 respectively, by applying a specific kind of Tool A.
A process definition graph (C) is a symbolic representation of the procedural details required by
the DFG to produce the DDG. Each step in the procedure is indicated by a set of icons
A. M. Ellison et al. - 40
surrounding a rectangular Step Bar. Icons outside the Step Bar specify pre-condition (to the left
of the Step Bar) or post-condition (to the right of the Step Bar) steps that control and monitor the
execution of the step. Icons within the Step Bar specify the ordering of the process (sequential,
parallel, or non-deterministic), and behavior to follow if exceptions occur. The circles are
handles to which can be attached specifications of resources needed in order to support execution
of the step. Asterisks indicate that zero or more instances of this step require execution,
dependent on the number of items available for processing.
A. M. Ellison et al. - 41
Figure 3. A data flow graph (DFG) illustrating the process used for the analysis of the eddy
covariance data and for tracking the effects of changes in the output resulting from changes in
u*threshold (one of the Selection criteria). As in Fig. 2A, datasets are indicated by rectangles and
processes are indicated by ovals. Arrows (edges) indicate the direction that the processes flow.
The diamond indicates that the Selected Data are used in both the Create Interpolation Model
process and in the Merge Data process. Creating this DFG is the first step in generating process
metadata for the analysis of the eddy covariance data.
A. M. Ellison et al. - 42
Figure 4. A data flow graph (DFG) that incorporates an iterative loop into the process used for
the analysis of the eddy covariance data and for tracking the effects of changes in the output
resulting from changes in u*threshold (one of the Selection Criteria). As in Fig. 2A and Fig. 3,
datasets are indicated by rectangles and analytical processes by ovals. Arrows (edges) indicate
the direction that the processes flow. Diamonds indicates points where datasets can be used
multiple times, or where decisions about processes have to be made.
A. M. Ellison et al. - 43
Figure 5. A data derivation graph (DDG) specifying the instance (or result) of a single
application of the DFG shown in Fig. 4. The specific datasets used (e.g., Selected Data 1) are
indicated by clipped boxes, and the specific processes used (e.g., application of the first
Interpolation Model) are indicated by clipped ovals. Arrows (edges) indicate from which
previous datasets a given dataset was derived. The dotted lines (annotations) indicate which
process was applied to which sets of data.
A. M. Ellison et al. - 44
Figure 6. A stacked view of a DFG and DDG that illustrates different instances using stacked
icons. The DFG is at the bottom of the stack (grey boxes and ovals), and the instances (DDGs)
are placed on top of the DFG (white boxes and ovals). This example combines the DFG shown
in Fig. 4 with the DDG shown in Fig. 5.
A. M. Ellison et al. - 45
Figure 7. A process definition graph (PDG) using Little-JIL notation for the analytic web
describing the analysis of eddy covariance data. Each step in the process is illustrated with a Step
Bar (rectangle) surrounded by a set of icons. Open triangles to the right and left of each step bar
indicate that there are no pre- or post-condition requirements for each step, whereas the closed
triangle to the right of the Create Interpolation Model Step Bar indicates that there is a pre-
condition requirement (here, that sample size is sufficiently large to create the model). Icons
within Step Bars specify the ordering of the process. The arrows in Estimate Total Nighttime
Carbon Flux, Create Row-Filled Data, Apply Selection Criteria, and Evaluate and Revise
specify sequential (row-by-row) processing. The X in Create Row-Filled Data specifies an
exception handler, and the connected icon (the circle with return-arrow) symbolizes the process
to be carried out to handle the exception. The asterisk above Evaluate and Revise indicates that
zero or more instances of this step may require execution. The circles are handles to which can
A. M. Ellison et al. - 46
be attached specifications of resources needed in order to support execution of the step.
Execution proceeds from bottom right to top left.