scientific data collection and management lecture 6

Scientific Data Collection and Management

Lecture 6

Data Collection

The previous lecture mentioned two projects where informatics enhanced scientific data practices.

Generally, informatics technology can enhance data collection by

•automating experimental and observational equipment,

•organizing and storing large volumes of observations,

•filtering observations for those of particular interest, and

•providing interfaces to ease manual recording tasks.

Advances in all of these areas greatly affect the reliability, accessibility, and utility of scientific data.

Data Collection: Autonomous Sciencecraft Experiment

The Jet Propulsion Laboratory at CalTech outfitted the EO-1 satellite with software tools for

JPL claims that the onboard autonomy saves over $1M in annual operating costs.

Similarly, Deep Space 1 included tools for planning, execution, and monitoring.

• recognizing interesting events, such as volcanic eruptions and flooding;

• planning human-requested observations and replanning to accommodate novel events; and

• evaluating mission plans to ensure that the appropriate resources exist.

Data Collection: LHCb

The Large Hadron Collider beauty experiment is an attempt to study why matter is favored over antimatter.

The experiment will

The computational filtering tools lead to a more manageable set of informative events.

• produce 10M events per second (proton collisions),

• filter these to 1M events at the detector level,

• filter these to 2K events using software that runs on 1K 16-core computers, and

• generate 250GB of information every hour.

Data Collection: CyberTracker

CyberTracker software enables the collection of field data by

Originally, CyberTracker let non-literate, expert trackers contribute to ecosystem monitoring.

This and similar informatics solutions have assisted across the sciences.

• providing intuitive interfaces to mobile computers for recording observations;

• offering interface development tools that requires little to no programming skill; and

• associating observations with location specific information for later analysis.

Data Management

Computer storage fundamentally altered the way that scientists could store, retrieve, and share their data.

Common storage formats include

•structured text, such as field-delimited or XML files;

•spreadsheet files, as used by Excel and OpenOffice; and

•databases, such as MySQL and Oracle.

While the formats are generally interconvertible they differ in readily supported capabilities.

For example, databases substantially ease/accelerate item selection and retrieval when compared to structured text.

Structured Text: Field Delimited Text

Field delimited text may be the most commonly used storage format apart from free text.

Conventions exist for field, record, and file indicators, and the first line often serves as a field header.

ID,weight,height,…143,14.2,9.3,…121,17.2,6.5,…

This representation is highly portable in that it does not require special software to interpret its contents.

Structured Text: Field Delimited Text

Field delimited text is highly portable in that

However, there are several disadvantages that suggest the need for more sophisticated approaches:

•any text editor can read and write the files,

•all programming languages support text manipulation, and

•sharing involves simple file transfer operations.

• field descriptions (e.g., measurement units) and format conventions are stored in separate, free text files;

• links among data sets are implicit or nonexistent; and

• there is no mechanism for propagating additions and corrections to data consumers.

Structured Text: XML

XML is a standard for specifying markup languages.

Data stored using the Extensible Markup Language are text-based, structured, and largely self documenting.

<data>

<record> <id>143</id><weight unit=“pounds”>14.2</weight><height unit=“inches”>9.3</height>

</record></data>

The tag and attribute names are domain specific and add an arbitrary level of description to text.

Structured Text: XML

XML retains many of the advantages of delimited text and solves the problem of detached data descriptions.

The structure of an XML file can be validated against a domain specific schema.

Structure that is explicitly shared via schemas links data represented in XML.

Disadvantages include

•the verbosity of XML files,

•the tediousness of reading and writing raw XML, and

•the problems of propagating alterations in the data.

Spreadsheets

Spreadsheet software is widely used in the sciences for data management, analysis, and visualization.

Data management involves storing an array of values in cells, where the first row often contains field names.

However, this functionality requires spreadsheet software to read and edit the data.

Such applications support arbitrarily explicit structure.

Also, they can store many data sets in one file.

Spreadsheets

As with delimited text, spreadsheet files are

•organized into fixed rows and columns,

•structured by convention or whim, and

•awkward in collaborative environments.

As data storage, spreadsheets improve upon text files in that the associated applications

• enable direct data access through the user interface;

• enable the dynamic update of values via formulas; and

• ease the creation of data subsets through pivot tables.

Databases

Databases provide a flexible platform for data storage, manipulation, and retrieval.

They differ from other approaches in that they

Databases also ease data integration from multiple measuring instruments (e.g., sensor networks).

However, databases generally require domain specific interfaces for effective use by researchers.

• enable explicit links among data sets;

• improve data sharing through centralized access;

• support multiple structural views of the data; and

• allow concurrent updates and accesses.

Databases

Although databases improve over file-centric approaches, they are relatively rare in science because

Nevertheless, databases are advantageous when data sharing extends beyond a small group of collaborators.

• they take substantial effort to design;

• they require training in a query language or development of application specific interfaces;

• many scientists work with small data sets that are easily managed in spreadsheets; and

• the value of data is ephemeral, in that they are infrequently accessed after analysis and publication.

Relational Database Schema

The database organization is expressed as a schema, which may take several forms.

Relational schemas define

Relations can be thought of as tables that contain partial information for each record.

Each table may come from a separate data source, and they all may be joined to construct a complete data set.

• relations, each of which have a heading and a body;

• headings, which have field names and field types;

• bodies, which contain field values; and

• keys, which identify records in the database.

Object-Oriented Database Schema

Object-oriented databases store information as objects with properties that may refer to other objects.

For example, properties of a planet such as Mars include

Objects may also include methods that interpret their properties (e.g., providing an average of a time series).

Knowledge of the various schemas is important so that one can develop informatics systems that hide them from users.

• mass, diameter, and density, stored in the object as floating point numbers; and

• celestial coordinates stored as a reference to a time series object that holds locations and other information.

Spatial and Temporal Databases

Spatially and temporally distributed data require special treatment to enable and accelerate task specific retrieval.

Spatial databases support

• data types such as points, lines, and regions;

• topological queries about adjacency and enclosure; and

• directional queries such as north-of and above.

Temporal databases support

• valid-time which records when an observation is true;

• temporal joins that properly handle the valid-time tables;

• temporal query terms such as timeslice and intersection.

Data Directory: Earth Science

The Global Change Master Directory organizes a wide variety of data into a topic hierarchy.

http://gcmd.nasa.gov

The hierarchy grounds out in metadata with a link to the actual data provider, which may require extra steps.

The data tend to be in heterogeneously structured field-delimited files, which makes integration painfully tedious.

Data Repository: ICPSR

• economic behavior and attitudes,• education,• health care and facilities,• legal systems,• organizational behavior,• social institutions and behavior, and more.

The Inter-university Consortium for Political and Social Research provides access to a wide range of data on

http://www.icpsr.umich.edu/

Data formats favor SPSS and SAS, which are commonly used in the social sciences for statistical analysis.

Metadata include bibliographic and methodological information stored as free text.

Data Repository: Protein Data Bank

The Protein Data Bank stores information on the structure of proteins, DNA, and other biological molecules.

The interface supports search and browsing based on the gene ontology and other established classification systems.

Structures are annotated with structured metadata and may be visualized in 3D directly from the web site.

Provides structural data in the PDB’s text-based format established in the 1970s or in a newer XML format.

Several informatics tools and secondary data repositories have arisen that take advantage of the PDB’s resources.

http://www.rcsb.org/pdb

Exploring the Protein Data Bank

Data Collection and Management: Summary

Informatics technology for data collection generally

Informatics technology for data management

The informatics tools that support these capabilities are often specific to a particular scientific domain.

As a result, the quality of the solutions may vary, and researchers risk reinventing the wheel.

• assists with the creation of structured knowledge; and

• integrates (identically structured) data from distributed knowledge sources.

• represents data in structured formats;

• provides interfaces to assist in data storage and retrieval;

• enables access to large repositories of data;