managing data from multiple disciplines, scales, and sites...

9
ELSEV IER Managing Data from Multiple Disciplines, Scales, and Sites To Support Synthesis and Modeling R. J. Olson,* J. M. Briggs, f J. H. Porter, G. R. Mah, ¶ and S. G. Stafford The synthesis and modeling of ecological processes at multiple spatial and temporal scales involves bringing to- gether and sharing data from numerous sources. This ar- ticle describes a data and information system model that facilitates assembling, managing, and sharing diverse data from multiple disciplines, scales, and sites to support integrated ecological studies. Cross-site scientific-domain working groups coordinate the development of data asso- ciated with their particular scientific working group, in- cluding decisions about data requirements, data to be compiled, data formats, derived data products, and schedules across the sites. The Web-based data and infor- mation system consists of nodes for each working group plus a central node that provides data access, project in- formation, data query, and other functionality. The ap- proach incorporates scientists and computer experts in the working groups and provides incentives for individu- als to submit documented data to the data and informa- tion system. Published by Elsevier Science Inc. INTRODUCTION ° Oak Ridge National Laboratory, Oak Ridge, Tennessee I Kansas State University, Manhattan 1 University of Virginia, Charlottesville U.S. Geologic Survey, Sioux Falls, South Dakota § Colorado State University, Fort Collins Address correspondence to Richard J. Olson, Oak Ridge National Laboratory, P.O. Box 2008, MS 6407, Oak Ridge, TN 37831-6407. E-mail: [email protected] Received 15 June 1998; revised 16 October 1998. REMOTE SENS. ENVIRON. 70:99-107 (1999) Published by Elsevier Science Inc. 655 Avenue of the Americas, New York, NY 10010 bring together investigators, ideas, and data for the pur- pose of locally validating the National Aeronautics and Space Administration Earth Observation System-era global data sets (Running et al., 1999, this issue). This coordinated exercise proposes to study the effect of scal- ing from a fine grain to a coarse grain on estimates of important biosphere variables from a range of biome conditions. Using standardized methods that incorporate extensive ground data sets, ecosystem models, and re- motely sensed imagery, sites will develop local maps of land cover class (LCC), leaf area index (LAI), and above- ground net primary productivity (NPP). This article de- scribes a data and information system (DIS) model that would facilitate assembling, managing, and sharing di- verse data from multiple disciplines, scales, and sites to support integrated ecological studies, such as the Big- Foot project. Integrated, Multisite Projects Projects that entail the synthesis and modeling of ecolog- ical phenomena over regions require multidiscipline data from multiple study areas or sites. Such complex projects often require an integrated database that is comprised of historical data, newly collected field data, remotely sensed image data, and GIS (geographic information sys- tem) coverages. The challenge of combining diverse data for global change research has been studied by a Na- tional Research Council review committee (NRC, 1995). On the basis of six case studies, the NRC committee identified 18 barriers to combining diverse data and de- veloped 10 keys to success. Only two of the ten keys dealt with technology while the others addressed cultural aspects of management, human behavior, and marketing associated with developing integrated data resources. A companion challenge of ensuring the availability of eco- 0034-4257/99/8—see front matter PII 50034-4257(99)00060-7 The synthesis and modeling of ecological processes at multiple spatial and temporal scales involves bringing to- gether and sharing data from numerous sources. An ex- ample of such an activity is the BigFoot project that will

Upload: others

Post on 29-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing Data from Multiple Disciplines, Scales, and Sites ...andrewsforest.oregonstate.edu/pubs/pdf/pub2607.pdfverse data from multiple disciplines, scales, and sites to support integrated

ELSEV IER

Managing Data from Multiple Disciplines,Scales, and Sites To Support Synthesisand Modeling

R. J. Olson,* J. M. Briggs, f J. H. Porter, G. R. Mah, ¶and S. G. Stafford

The synthesis and modeling of ecological processes atmultiple spatial and temporal scales involves bringing to-gether and sharing data from numerous sources. This ar-ticle describes a data and information system model thatfacilitates assembling, managing, and sharing diversedata from multiple disciplines, scales, and sites to supportintegrated ecological studies. Cross-site scientific-domainworking groups coordinate the development of data asso-ciated with their particular scientific working group, in-cluding decisions about data requirements, data to becompiled, data formats, derived data products, andschedules across the sites. The Web-based data and infor-mation system consists of nodes for each working groupplus a central node that provides data access, project in-formation, data query, and other functionality. The ap-proach incorporates scientists and computer experts inthe working groups and provides incentives for individu-als to submit documented data to the data and informa-tion system. Published by Elsevier Science Inc.

INTRODUCTION

° Oak Ridge National Laboratory, Oak Ridge, TennesseeI Kansas State University, Manhattan1 University of Virginia, Charlottesville

U.S. Geologic Survey, Sioux Falls, South Dakota§ Colorado State University, Fort CollinsAddress correspondence to Richard J. Olson, Oak Ridge National

Laboratory, P.O. Box 2008, MS 6407, Oak Ridge, TN 37831-6407.E-mail: [email protected]

Received 15 June 1998; revised 16 October 1998.

REMOTE SENS. ENVIRON. 70:99-107 (1999)Published by Elsevier Science Inc.655 Avenue of the Americas, New York, NY 10010

bring together investigators, ideas, and data for the pur-pose of locally validating the National Aeronautics andSpace Administration Earth Observation System-eraglobal data sets (Running et al., 1999, this issue). Thiscoordinated exercise proposes to study the effect of scal-ing from a fine grain to a coarse grain on estimates ofimportant biosphere variables from a range of biomeconditions. Using standardized methods that incorporateextensive ground data sets, ecosystem models, and re-motely sensed imagery, sites will develop local maps ofland cover class (LCC), leaf area index (LAI), and above-ground net primary productivity (NPP). This article de-scribes a data and information system (DIS) model thatwould facilitate assembling, managing, and sharing di-verse data from multiple disciplines, scales, and sites tosupport integrated ecological studies, such as the Big-Foot project.

Integrated, Multisite ProjectsProjects that entail the synthesis and modeling of ecolog-ical phenomena over regions require multidiscipline datafrom multiple study areas or sites. Such complex projectsoften require an integrated database that is comprised ofhistorical data, newly collected field data, remotelysensed image data, and GIS (geographic information sys-tem) coverages. The challenge of combining diverse datafor global change research has been studied by a Na-tional Research Council review committee (NRC, 1995).On the basis of six case studies, the NRC committeeidentified 18 barriers to combining diverse data and de-veloped 10 keys to success. Only two of the ten keysdealt with technology while the others addressed culturalaspects of management, human behavior, and marketingassociated with developing integrated data resources. Acompanion challenge of ensuring the availability of eco-

0034-4257/99/8—see front matterPII 50034-4257(99)00060-7

The synthesis and modeling of ecological processes atmultiple spatial and temporal scales involves bringing to-gether and sharing data from numerous sources. An ex-ample of such an activity is the BigFoot project that will

Page 2: Managing Data from Multiple Disciplines, Scales, and Sites ...andrewsforest.oregonstate.edu/pubs/pdf/pub2607.pdfverse data from multiple disciplines, scales, and sites to support integrated

100 Olson et al.

logical data has been studied by the Ecological Societyof America (Gross et al., 1995). This study identified agrowing concern that long-term ecological data are lostafter projects are completed because investigators moveon to other interests and the data are rarely archived.Both data management concerns of combining diversedata and providing ready access and long-term archive todata are addressed as an integral part of the proposeddata and information system.

One of the projects examined by NRC as an exam-ple of a large integrated ecological study is the FirstISLSCP (International Satellite Land Surface Climatol-ogy Project) Field Experiment (FIFE) funded by NASA(NRC, 1995). The FIFE project collected intensive re-mote sensing and field data on a prairie site in Kansasin 1987 and 1989 to better understand fluxes betweenthe land surface and the atmosphere and to develop as-sociated remote-sensing methodologies. The project in-cluded 29 teams organized in six disciplinary groups. Anextensive collection of FIFE data and metadata werecompiled and distributed as a 5-volume CD-ROM set(Strebel et al., 1994a), and these data are currently avail-able online from the Oak Ridge National Laboratory(ORNL) Distributed Active Archive Center (DAAC)World Wide Web home page (http://vvww-eosdis.orn1.-gov). The DAAC is the long-term archive for the data.The FIFE information system was also used to demon-strate the implementation of a conceptual framework forscientific information systems (Strebel et al., 1994b).

LTER Network Information System

Another example of developing an integrated database isthe efforts of the Long-Term Ecological Research(LTER) Network to facilitate intersite research by link-ing individual site information systems. In the LTERNetwork, individual sites routinely collect daily climatedata and maintain the data in local information systemsusing a variety of formats. Climate data are almost uni-versally collected in multisite studies and are typically themost commonly requested data to perform synthesis andmodeling. To facilitate such synthesis, the LTER com-munity gathered individual site temperature and precipi-tation data and created online monthly summaries foreach site (Henshaw et al., 1998). While these summariessatisfied a need for access to monthly site climate data,there was no provisions for maintaining and updatingthese summaries or satisfying frequent requests for dailydata; therefore, a strategy was developed to provide cli-matic data dynamically (Henshaw et al., 1998). Each siteprovides access to standardized daily climate files via anInternet address that points to the location of static filesor dynamic scripts. Daily climate data are harvested au-tomatically by a central site into a centralized databaseand applications programs produce two monthly distribu-tion reports or formats from the daily climate database.

The challenge of developing and implementing this strat-egy is reflected by the fact that the process has evolvedover 10 years, starting with the establishment of stan-dards for baseline meteorological measurements in 1986(Greenland, 1986).

In addition, the LTER information managers haveled a concerted effort to develop a LTER Network In-formation System (NIS) (1995 LTER Data ManagersWorkshop, http://lternet.edu/im/dm95report.htm) to sup-port basic ecological research, both at the site and net-work levels. When fully implemented, the system willimprove Internet connectivity among the research sitesand transform a collection of individual sites and data-bases into a coherent network. It will be a distributedsystem so that site data and information can be main-tained locally. The scaleable system will enable users toexpand site-specific studies into intersite studies and touse prototype systems or modules, that are developed forsmall groups or studies, for larger groups and intersiteuse. Queries and browsing of data sets and metadata lo-cated at multiple sites will be essential characteristics forthe system to function as an intersite research tool. Thesystem could have the capacity for data input at the sitelevel to accommodate the needs generated by specificresearch groups. Currently NIS includes an investigatorlist, bibliography (Chinn and Bledsoe, 1997), data set di-rectory, climate database, and a prototype compilation ofsite characteristics, with information and data from allsites in each. Future plans are to expand this system toencompass all types of ecological data from the LTERsites.

An important issue in developing the LTER Net-work Information System is the consistent use of existingcontent standards for data and metadata at all sites, aswell as establishment of new content standards wherethey do not exist. Standards will be a prerequisite for thedistributed system's capability to present data and infor-mation in a consistent fashion, independent of their orig-inal format and location. Standardization also is helpfulfor supporting meaningful synthesis of data across multi-ple sites. However, network-wide format standards,which may simplify implementation of the system, areless critical than the content standards, since a uniformappearance could be achieved by using filter programsto create standard views from files in different formatsas maintained by sites. Nevertheless, data and metadataformat decisions will still affect a variety of tasks, suchas development of system parts such as a distributedcross-site catalog. This will require consistency betweensites, standardized keywords, development of the catalogquery system (minimum metadata standards, structure/format/access methods standard within site, writing fil-ters to produce desired displays, etc.). Platform indepen-dence will make it possible to use the system in the het-erogeneous hardware and software environments in useat different sites.

Page 3: Managing Data from Multiple Disciplines, Scales, and Sites ...andrewsforest.oregonstate.edu/pubs/pdf/pub2607.pdfverse data from multiple disciplines, scales, and sites to support integrated

Integration and Management of Ecological Data 101

Data and Information System Model GoalsThe data and information system model for multisiteprojects incorporates concepts tested in the FIFE proj-ect and in the evolving LTER Network Information Sys-tem design. The goals are to develop a multitiered infor-mation system that supports the project, facilitatesmaximum use of the data by the projects' investigators,creates a data archive, and provides data access to otherresearchers. The data and information system model in-corporates data processing as an integral component ofongoing activities associated with the project, especiallyamong science working groups and at individual sites ofa multisite project.

Key features associated with the model include:

data will be available to both internal project us-ers and outside users in a timely manner;appropriate credit will be given to the data origi-nators;data will undergo quality assurance and enhance-ments into formats easily usable by scientists;metadata will be provided that are essential tothe understanding and use of the data;data from dispersed sources will be compiledinto standardized data sets or data products;the overall data requirements, formats, andschedules will be determined by science domainswith oversight by the project leader;the data management component will requireproject resources shared between science do-mains and the project data and informationsystem.

OVERVIEW

An information system model for multisite projects wasdeveloped by the authors at workshop in May 1996 asso-ciated with planning of a predecessor of the BigFootproject. The core component of the data and informationsystem was perceived as a set of cross-site scientific-do-main working groups (e.g., NPP, LAI, remote sensing,etc.). Each of the working groups would coordinate thedevelopment of data associated with their particular sci-entific working group, including decisions about data re-quirements, data to be compiled, data formats, deriveddata products, and schedules across the sites. This ar-rangement can be envisioned as a set of pyramids withthe sites and their data at the bottom of each pyramid,data processing in the middle, and the final standardizeddata set(s) at the peak. The workshop participantsstressed the need for providing incentives for individualsto prepare documented data sets. This model contraststo traditional information system models for ecologicalresearch projects where information management maybe relegated to a separate, isolated data group and thelong-term maintenance of data may not be addressed. It

is similar to the model discussed by Strebel at al.(1994b), with additional emphasis placed on responsibili-ties of scientific working groups and the long-term datadistribution and archive center.

The overall data and information system consists ofnodes for each scientific working group plus a centralnode (Fig. 1). The central node is the primary entranceto all of the data. It would include access control, projectinformation, data query, data archive, and other func-tionality. The system is envisioned to be Web-based andaccessible through one or more of the popular browsersusing an html type interface to the data and information.

Scientific-domain working groups might include in-dividuals collecting field measurements, those compilingremote sensing and spatial data, those developing mod-els, and those aggregating and extrapolating spatial infor-mation. The group leader for each scientific-domainworking group provides scientific and technical leader-ship. The group leader would play a critical role in thedevelopment of data sets and data products for eachworking group. They would be in a position to identifythe critical data resource needs and quality assurance cri-teria because of their scientific expertise. Equally impor-tant, as a scientific leader in the project, they would havethe prestige needed to encourage other participants tofully collaborate in the development of data sets, meta-data, and data products.

Group leaders would insure that the multisite datafor their node were either entered into the database oraccessible in standardized formats within the overallproject schedule. However, the leaders may not have thetechnical expertise in information management to fullyimplement the data compilation efforts. For this reason,one or more technically oriented partners for each leaderwill be required. The resulting partnership assures scien-tific credibility, computer efficiency, and timeliness withthe leader providing guidance and clout and the partnerproviding expertise in database tools and networking.The project-level data activities would be performed bya project data staff comprised of the technical partnersfor the groups and a leader for the project data and in-formation system. In addition, the project leader pro-vides overall direction and coordination among thegroups and sites.

The data and information system provides access tothe complete, combined, consistent data at each node(some nodes may be located physically together). Theremay be links to other data archives that would provideaccess to project or related data. Access may be limitedto data originators during the active phases of data com-pilation and analysis. However, as data sets become moremature, they become publicly accessible through theproject data and information system. Finally, they aremoved to a long-term archive and distribution center andare advertised through master data directories, such asthe Global Change Master Directory (GCMD).

Page 4: Managing Data from Multiple Disciplines, Scales, and Sites ...andrewsforest.oregonstate.edu/pubs/pdf/pub2607.pdfverse data from multiple disciplines, scales, and sites to support integrated

2. Scientific Working Groups

3. Project Data &InformationSystem

DataDistribution& Archive Center

MasterData Directory

102 Olson et al,

PI-2 T1-3/P1-n1. Investigators PI-2 P1-3

PI-1 IP 1

-n

Figure 1. Information system model for man-aging and sharing data from multiple sites anddisciplines to support ecological synthesis andmodeling.

GENERIC ISSUESIncentives for Compiling Data SetsA key to compiling well-documented data sets and creat-ing an integrated database is to encourage the coopera-tion of those who produce the data by providing incen-tives and rewards for database development (Olson et al.,1996). Some of the incentives that may be considered indeveloping the data and information system are summa-rized in Table 1. The process can also be limited by timeand funds, authorship issues, insufficient global perspec-tive (e.g., investigators may not see the need for re-cording latitude and longitude for a single site), a lack ofstandards and guidelines along with associated training,or the use of standards that do not fit complex ecologicaldata and field situations.

Data Release PolicySince October 1990, the LTER network has had guide-lines (see Porter and Callahan, 1994) to promote makingresearch data available to other researchers. To imple-ment these guidelines, each LTER site has developed itsown data management policy in consultation with key in-

vestigators and higher administrative units. In 1998 theguidelines were incorporated into a network-wide infor-mation management policy (Table 2). Table 2 providesguidelines and rationale, but each site is expected to de-fend its own implementation through the site and peer-review process. The LTER network has developed addi-tional recommendations to alleviate concerns about thepossible unethical use of data sets resulting from thispolicy; see http://lter.edu for more details.

CreditsTo encourage individuals to compile data, mechanismsmust be developed to give credit to the individuals asso-ciated with data compilation, especially in multicompo-nent projects. Credit will be conveyed by citations thatindicate the individuals responsible for the data in thedata set. The preferred citation, in a format similar tojournal references, can be included in the rnetadata. Be-cause data sets are sometimes combined for synthesis orbroader modeling applications, a policy must be devel-oped for citing data from multiple contributors. Thescheme would be analogous to citing author(s) of chap-

Page 5: Managing Data from Multiple Disciplines, Scales, and Sites ...andrewsforest.oregonstate.edu/pubs/pdf/pub2607.pdfverse data from multiple disciplines, scales, and sites to support integrated

Integration and Management of Ecological Data 103

Table I. Incentives To Encourage the Development of Multiplesite Datasets

Goal IncentiveIncrease cooperation between scientists and data managersPromote data flow from scientists to data managers

Increased completeness and timeliness of data products

Increase usefulness of the data

Increase usefulness of the dataEncourage scientists to compile data

Protect scientist's exclusive right for initial publication of dataanalysis

Involve data managers in project planning and scienceProduce rewarding data products with deadline as joint effort

between scientists and data managers, e.g., CD-ROM of dataAllocate sufficient resources for data activities by investigators and

data managersRequire documentation as an essential and useful component of

the projectAdopt metadata standards (e.g., Michener et al., 1997)Develop citation policy to give credit to data compilers, especially

for datasets produced from multiple contributorsDevelop data release policy (e.g., see LTER data policy)

ters in a book, editor(s) of a book, and editor(s) of a se-ries of related volumes. The following scheme is pro-posed for use by a multidiscipline project. If a article orreport uses data from a specific data set, then the dataoriginator(s) will be cited similar to a book chapter. If aarticle used data processed by a science team at the the-matic or site level, then the leader of the science-work-ing group would be cited similar to a book editor. Fi-nally, if the entire project database was used, then anacronym can be used to represent the database and usedto reflect contributions of all participants similar to abook series editor.

Data EnhancementEven when well-documented data sets are available, theyoften must be augmented, or enhanced, before they canbe used in new applications. Enhancing the data, as usedin the context of this article, refers to data processingthat is often required to perform integrated analyses andmodeling within a multicomponent project. An enhanceddata set is a data product derived from an initial data setand which is made more complete and internally consis-tent. The enhancement process addresses the practicalaspects of incomplete data sets associated with fieldwork

(e.g., problems associated with broken instruments, mis-calibrated sensors, observer errors, and unusual fieldevents). Missing values or outliers may be replaced withestimated values with appropriate flags and documenta-tion of the estimation process. Calibration factors may beadjusted on the basis of an analysis of the project data.Data may be aggregated or extrapolated to create datasets with appropriate spatial and temporal characteristics.Associated site data—such as data related to soils, topog-raphy, and climate—may be acquired from other existingsources as needed by the models and become part of theenhanced data set.

Data DistributionThe Internet provides a powerful mechanism for facili-tating the flow of information to a wide variety of users.However, this technology does not automatically solvethe concerns about documenting the data integratingdisperse data, or creating a long-term data archive. Al-though the technology for creating individual Web homepages is widely available, the functionality of the sitesvaries, depending on the background and resources ofthe site administrators. The basic functions of a Web siteand associated data center that distributes data should

Table 2. Data Access Policy for the LTER Network (November 1998)

There are two types of data: Type I (data that ae freely available within 2-3years) with minimum restrictions, Type II [exceptional data sets that are avalable onlywith written permission from the Pl/investigator(s)]. Implied in this timetable isthe assumption that some data sets require more effort to get online andthat no "blanket policy" is going to cover all data sets at all sites. However, eachsite would pursue getting all of their data online in the most expedientfashion possible

The number of data sets that are assigned TYPE II status should be rare inoccurrence and that the justification for exceptions must be well documented andapproved by the lead PI and data site manager. Some examples of Type II datamay include: locations of rare or endangered species, data that are coveredby copyright laws (e.g., TM and/or SPOT satellite data) or some types of censusdata involving human subjects

Page 6: Managing Data from Multiple Disciplines, Scales, and Sites ...andrewsforest.oregonstate.edu/pubs/pdf/pub2607.pdfverse data from multiple disciplines, scales, and sites to support integrated

104 Olson et al.

include distributing data on multiple media, allowingdata browsing even by users with limited Internet capa-bilities, maintaining and distributing metadata, offering adata-search-arid-order capability, providing data security,and providing a data archive. During the active phase ofthe project, all of the individuals, teams, and central in-formation system may have active Web sites with linksto each other. Web pages for individuals and teams mayhave limited functionality and accessibility while the cen-tral Web page has full functionality.

While the Internet is especially valuable for data ex-change during the active phase of a project, CD-ROMsprovide a valuable way to organize and distribute the col-lection of project data at the completion of the project.In fact, creating the CD-ROM package of data sets andmetadata can provide motivation for investigators tocomplete individual data sets, especially the metadata, tobe included on the CD-ROM. The CD-ROM is a staticrepresentation of the data, data products, and imagery.As a consequence, provisions for correcting errors, up-dating data sets, and adding new data and data productsshould be anticipated, such as maintaining lists of all in-dividuals receiving CD-ROMs.

SYSTEM DESCRIPTION

Compiling multidisciplinary or multisite data to supporta synthesis project will be accomplished by a data matu-ration process (Strebel et al., 1994b) in which data flowfrom , investigators that collect and analyze field measure-ments through working groups that collate and standard-ize data to the project level for synthesis studies and, fi-nally, into a long-term data distribution arid archivecenter that provides data to secondary users. In addition,descriptions of the data set may be submitted to a masterdata directory with search functionality. Although thefunctions can overlap or several functions can be per-formed at one level, the general roles and responsibilitiesof individuals at each of the five levels are describedbelow.

InvestigatorTraditionally investigators collect and analyze their owndata with little need for standardized data sets or writtenmetadata. In a synthesis project, investigators assume re-sponsibilities for collecting, inputting, performing qualitycontrol (QC) and quality assurance (QA), processing, anddocumenting their data following working group (nextsubsection) and project (subsection after next) guidelinesand schedules for managing their data. Data preparationto meet project synthesis needs can be costly, often be-yond the investigators' interests and resources; therefore,it is essential not only to provide guidance and resourcesbut also to ensure that individuals receive credit and that

incentives are given, such as co-authorship of articlesthat use the data from many sites. After review and au-thorization by the investigator, data would be made avail-able to the appropriate project working group.

Working GroupThe scientific-domain working groups have needs tostandardize and collate data collected at the multiplesites to allow for combined synthesis and modeling. Of-ten these activities are tied to running models and pub-lishing articles by groups of scientists—this incentive iskey to developing complete and consistent data sets forthe project. Working groups would coordinate the devel-opment of data associated with their respective domains,including issues of content, consistency, and complete-ness. At the beginning of the project working group lead-ers, in partnership with their technical support and guid-ance from the project DIS (next subsection), wouldestablish data requirements, data content standards, datacollection methods, processing algorithms, data formats,derived data products, and data processing schedules.

Working groups would be responsible for per-forming QA checks and metadata reviews, collating theclean and documented data sets from the investigators,and possibly processing the collated data with standardalgorithms. The QA checks include plots, statistics, andsorted lists; a review of logical combinations of variables;standardization of dates, times, coordinates, units, codes;and consistent data organization with consistent namingconventions. Comparing and combining data from multi-ple sites can often facilitate QA checks that otherwise arenot normally performed on data from individual sites. Ifneeded, the working group would develop ways for fill-ing in missing or problem data. Developing the data setsand data products for the scientific-domain workinggroup may be done locally (at the same location, usinglocal equipment or central resources) or remotely (withthe working group leader at different locations). Theworking groups may not have the interest or resourcesfor maintaining and distributing the data sets after theirsynthesis studies are completed. Therefore, it is impor-tant for the working groups to anticipate the need tomake their documented data available to a central nodefor synthesis arid modeling within the project and foreventual distribution to users outside of the project.

Project DISThe project data and information system (DIS) makesdata from the scientific-domain working groups availablefor project-level synthesis and modeling. Data activitiesat the project level include providing coordination be-tween the working groups so that the data from theworking groups will fit together at the project level andimplementing the overall system. The project leader andthe DIS working group would oversee information re-

Page 7: Managing Data from Multiple Disciplines, Scales, and Sites ...andrewsforest.oregonstate.edu/pubs/pdf/pub2607.pdfverse data from multiple disciplines, scales, and sites to support integrated

Integration and Management of Ecological Data 105

Table 3 Typical Primary and Secondary Data Processing Roles Performed by a Project Data and Information System andTypes of Data and Information Processed at the Project Level

Primary Roles Secondary Roles Project Information

Establish data and metadata standards

Prepare documented data sets

Review metadata, data QA, consistency,and completeness .

Provide data processing support

Develop project data and informationsystem (Web-based)

Manage the project data: backup, accesscontrol, etc.

Distribute project information and dataas requested

Gather historical and backgrounddata

Distribute copies of large off-linedata sets

Process meteorological and otherauxiliary site data

Process satellite/aircraft remotesensing

Provide data analysis tools andsupport

Prepare datasets for public access

Transfer data to a long-term dataarchive center

Experimental data and metadata

Inventory of data needs and available data

Geographic information system based data

Remote sensing satellite data

Flight logs, videos, site photos, aerial photos, andmaps

Project data policies, guidelines, and standards

Contact list of individuals associated with project

Annotated bibliography of key papers

sources and data products from the individual workinggroups for overall project needs. To expedite the flow ofdata during the later stages of the project, it is vital toinvolve the DIS group in all aspects of the project fromthe beginning.

The DIS group develops and maintains a centralWeb home page to distribute information and data to in-vestigators within the project. It has primary responsibil-ity for the central node and linking the other nodes to-gether. The DIS staff may provide data processingsupport for project investigators, distribute project infor-mation, and work closely with investigators to prepare ac-curate and completely documented data sets. The DISstaff may include individuals associated with specificworking groups, that is, technical partners with dualmembership in the DIS. The primary and secondaryroles of the project DIS are described in Table 3. Table3 also lists types of project data and information that maybe included in the DIS.

In addition to data and data products, the DIS groupmay write programs for loading data, applications soft-ware, and models. Some of these programs may be keptinternally as tools for managing data or maintaining pro-grams used to generate data sets. If these programs areappropriate to release with the data sets as companionfiles or separate files, then metadata is prepared andsoftware submitted to the distribution and archivecenter.

The project DIS generally does not support data dis-tribution to a broader set of users or commit to long-termarchiving of the data after completion of the project. Asdata sets are completed or at the end of the project, cleandata sets and metadata are submitted to a distribution andarchive center for distribution to the general public.

Distribution and Archive CenterThe long-term distribution and archive center maintainsand distributes data and provides support to data users.

The center works with the project members to identifyhow best to support the project and to establish a sched-ule for transferring data. Data sets are transferred to thecenter as the project finishes or as individual data setsare completed. Data sets are reviewed for consistencyand completeness and entered into the center's informa-tion management system. The center must acquire finan-cial and institutional support to provide for long-term ar-chive of the data. The LTER Network InformationSystem is evolving toward providing the distribution andarchive functions for LTER data. The Oak Ridge Na-tional Laboratory DAAC for Biogeochemical Dynamicsprovides these functions for NASA and selected non-NASA field measurement data.

The distribution and archive center reviews meta-data to ensure that they are accurate and complete andthat they provide present and future users (those not di-rectly involved or familiar with the project) adequate in-formation to understand the nature of the data sets andto assess their appropriateness for purposes beyond thescope of the original project. The data are reviewed toensure that they agree with the metadata and are consis-tent and complete. The center also extracts keywordsand other metadata for use as search criteria that enableusers with varying scientific backgrounds to locate andacquire data that meet their needs. Finally, the centeracquires approval from data providers for final data prod-ucts prior to public release and loads the data and meta-data into the archive. Other data center functions in-clude: providing data security with password protectionand backups of data files; providing links to investigators,project science teams, and project data and informationsystem Web pages; archiving data (multiple copies, long-term retention, migration to new media as required); no-tifying users of any errors, changes, or updates to thedata; advertising and actively marketing available data re-sources among scientists and other users; maintainingstandard and valid keywords; supporting a data search-

Page 8: Managing Data from Multiple Disciplines, Scales, and Sites ...andrewsforest.oregonstate.edu/pubs/pdf/pub2607.pdfverse data from multiple disciplines, scales, and sites to support integrated

1 06 Olson et al.

and-order capability; providing user support; and sup-porting data needs of decision makers, educators, andthe general public.

The center continues to maintain, distribute, and ar-chive the data after the project finishes. The center sup-ports the field project by providing a link between dataproviders and users. That is, the center generally may beable to answer many of the user questions, but if neces-sary the center could contact investigators for more in-formation. In addition, the center may routinely generatestatistics on users of the data, solicit information fromthe investigators on publications and new data products,solicit information from the investigators and user com-munity about potential errors and updates, provide amechanism for investigators to exchange ideas on pro-posed activities or improvements, incorporate new dataand data updates, and communicate changes and updatesto users.

Master Data DirectoryMaster data directories, or data catalogs, contain largecollections of descriptions of data sets but do not workdirectly with or archive the data. The data directory con-tains pointers to the numerous locations that contain thedata and therefore provides a mechanism for scientists tolocate, acquire, and integrate data from various archives.Generally, directory-level metadata are very broad intheir coverage so that they assist a scientist in learningwhat data are available for a particular area. The meta-data allow a user to determine whether a data set is in-teresting enough that additional specific informationshould be sought. Directory-level metadata can usuallybe used to reject a particular data set as not appropriate;however, to fully determine whether the data set locatedin the directory is of real interest to the scientist, moreinformation about that data set is required at anotherlevel of the metadata hierarchy. To acquire the actualdata set described in the master data directory, the usermust access the center or individual that maintains thedata set.

The NASA-funded Global Change Master Directory(GCMD) is an example of a master data directory. Along-term goal of this effort is to provide science userswith the ability to find and view information about sci-ence data regardless of what data system actually holdsthe metadata. The directory metadata interoperabilityhas been addressed through two mechanisms: the devel-opment of a Directory Interchange Format (DIF) andthe development of a DIF-based master directory thatcontains data across agencies, disciplines, and interna-tional boundaries. The DIF contains about 35 descriptivefields with controlled lists of keywords for many of thefields. The GCMD, including a description of the DIFstructure, can be accessed at http://gcmd.gsfc.nasa.gov/.

SUMMARY

The data and information system model described in thisarticle facilitates assembling, managing, and sharing di-verse data from multiple disciplines, scales, and sites tosupport integrated ecological studies. It allows bringingtogether and sharing data from numerous sources for thesynthesis and modeling of ecological processes at multi-ple spatial and temporal scales. The core component ofthe system is a set of cross-site scientific-domain workinggroups. Each of the working groups would coordinatethe development of data associated with their particularscientific working group, including decisions about datarequirements, data to be compiled, data formats, deriveddata products, and schedules across the sites. Therefore,the proposed approach incorporates scientists in theworking groups and provides incentives for individuals tosubmit documented data. The overall data and informa-tion system consists of nodes for each working groupplus a central node. The central node is the primary en-trance to all of the data and includes access control, proj-ect information, data query, data archive, and other func-tionality. The system is envisioned to be Web-based andaccessible through one or more of the popular browsersusing an html-type interface to the data and information.

Ideas and experiences from many individuals are incorporatedin this article. The authors especially acknowledge the continu-ing dialog on developing data and information systems withinthe LTER network of data managers, the DAAC system devel-opment, and data centers at ORNL. Comments from threeanonymous reviewers are also greatly appreciated. This re-search was sponsored by the National Aeronautics and SpaceAdministration. ORNL Environmental Sciences Division Pub.No. 4900. ORNL is managed by Lockheed Martin Energy Re-search Corporation for U.S. Department of Energy under con-tract DE-ACO5-960R22464. The U.S. government retains anon-exclusive royalty free license to publish or reproduce thepublished form of this contribution, or allow others to do so,for U.S. Government purposes.

REFERENCES

Chinn, H., and Bledsoe, C. (1997), Internet access to ecologicalinformation: the U.S. LTER All-Site Bibliography Project.BioScience 47:50-57.

Greenland, D. (1986), Standardized meteorological measure-ments for long-term ecological research sites. Bull. Ecol.Soc. Am. 67:275-277.

Gross, K. L., Pake, C. E., and the FLED Committee Members(1995), Future of Long-Term Ecological Data (FLED), Vol.1: Text of the Report, Ecological Society of America, Wash-ington, DC.

Ilenshaw, D. L., Stubbs, M., Benson, B. J., Baker, K., Blodgett,D., and Porter, J. H. (1998), Climate database project: astrategy for improving information access across researchsites. In Data and Information Management in the Ecologi-cal Sciences: A Resource Guide, (W. K. Michner, J. P. Por-

Page 9: Managing Data from Multiple Disciplines, Scales, and Sites ...andrewsforest.oregonstate.edu/pubs/pdf/pub2607.pdfverse data from multiple disciplines, scales, and sites to support integrated

Integration and Management of Ecological Data 107

ter, and S. G. Stafford, Eds.), LTER Network Office, Uni-versity of New Mexico, Albuquerque, NM, pp. 123-127.

Michener, W. K., Brunt, J. W., Belly, J., Kirchner, T. B., andStafford, S. G. (1997), Non-geospatial metadata for the eco-logical sciences. Ecol. Appl. 7:330-342.

National Research Council (NRC) (1995), Finding the Forestin the Trees: The Challenge of Combining Diverse Environ-mental Data, National Academy Press, Washington, DC.

Olson, R. J., Voorhees, L. D., Field, J. M., and Gentry, M J.(1996), Packaging and distributing ecological data frommultisite studies. In Proceedings of the Eco-Inforrna '96,Global Networks for Environmental Information, Environ-mental Research Institute of Michigan, Ann Arbor, pp.93-102.

Porter, J. H., and Callahan, J. T. (1994), Circumventing a di-lemma: historical approaches to data sharing in ecologicalresearch. In Environmental Information Management and

Analysis: Ecosystem to Global Scales (W. K. Michener, J.W. Brunt, and S. G. Stafford, Eds.), Taylor and Francis,Bristol, PA, pp. 193-203.

Running, S. W., Baldocchi, D., Turner, D., Gower, S. T., Bak-win, P., and Hibbard, K. (1999), A global terrestrial moni-toring network integrating tower fluxes, flask sampling, eco-system modeling and EOS satellite data. Remote Sens.Environ. 70:108-127.

Strebel, D. E., Landis, D. R., Huemmrich, K. F., and Meeson,B. W. (1994a), Collected Data of the First ISLSCP Field Ex-periment, Vols. 1-5, published on CD-ROM, National Aero-nautics and Space Administration, Greenbelt, MD.

Strebel, D. E., Meeson, B. W., and Nelson, A. K. (1994b), Sci-entific information systems: a conceptual framework. In En-vironmental Information Management and Analysis: Ecosys-tem to Global Scales (W. K. Michener, J. W. Brunt, and S.G. Stafford, Eds.), Taylor and Francis, Bristol, PA, pp.59-85.