uc berkeley ischool nov, 2009

Post on 18-Dec-2014

1.353 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

“Vertical section drawing of Cavendish's torsion balance instrument including the building in which it was housed.” http://en.wikipedia.org/wiki/Cavendish_experiment

Hinges and Loops? -- Data as Evidence

I-School UC, BerkeleyNovember 13, 2009

The Tragedy of Othello: The Moor of Venice (Act 3 Scene 3)

“Othello: ‘Villain: be sure thou prove my love a whore; Be sure of it; give me the ocular proof; Or by the worth of man’s eternal soul, Thou hadst been better born a dog Than answer my naked wrath!

Iago: ‘Is’t come to this?’

Othello: ‘Make me to see‘t ; or at the least so prove it, That the probation bear no hinge nor loop To hang doubt on; or woe upon thy life!’ “

“So the universe has always appeared to the natural mind as a kind of enigma, of which the key must be sought in the shape of some illuminating or power-bringing word or name. That word names the universe's principle, and to possess it is, after a fashion, to possess the universe

itself 'God,' 'Matter,' 'Reason,’ 'the Absolute,’ ‘Energy,’ are so many solving names.

You can rest when you have them. You are at the end of your metaphysical quest.”

William James. "What Pragmatism Means". Lecture 2 in Pragmatism: A new name for some old ways of thinking. New York: Longman Green and Co (1922): 52-52.

http://www.archive.org/stream/pragmatismnewnam00jame

Internet Archive: http://www.archive.org/stream/pragmatismnewnam00jame

Note Date of Publiction: 1922

Clear definitions are good (!) We should not reflexively rely on metaphysical

“solving” / “power-bringing” words…ADD to James’s list?: “Knowledge” “Information” “Data” ???

“Data” ?

Usage Data: The word data is the Latin plural of datum, neuter past participle

of dare, "to give", hence "something given".

“ Data leads a life of its own quite independent of datum, of which it was originally the plural. It occurs in two constructions: as a plural noun (like earnings), taking a plural verb and plural modifiers (as these, many, a few) but not cardinal numbers, and serving as a referent for plural pronouns; and as an abstract mass noun (like information), taking a singular verb and singular modifiers (as this, much, little), and being referred to by a singular pronoun. Both constructions are standard. The plural construction is more common in print, perhaps because the house style of some publishers mandates it.”

The Merriam-Webster Online Dictionary http://www.merriam-webster.com/dictionary/data

“Data” ? [technological]

“…’data’ are defined as any information that can be stored in digital form and accessed electronically, including, but not limited to, numeric data, text, publications, sensor streams, video, audio, algorithms, software, models and simulations, images, etc.” -- Program Solicitation 07-601 “Sustainable Digital Data Preservation and Access Network Partners (DataNet)”

Taken in this broadest possible sense, “data” are thus simply electronic coded forms of information. And virtually anything can be represented as “data” so long as it is electronically

machine-readable.

“The digital universe in 2007 — at 2.25 x 1021bits (281 exabytes or 281 billion gigabytes) — was 10% bigger than we thought. The resizing comes as a result of faster growth in cameras, digital TV shipments, and better understanding of information replication.

“By 2011, the digital universe will be 10 times the size it was in 2006.

“As forecast, the amount of information created, captured, or replicated exceeded available storage for the first time in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home.

“Fast-growing corners of the digital universe include those related to digital TV, surveillance cameras, Internet access in emerging countries, sensor-based applications, datacenters supporting “cloud computing,” and social networks.

The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- Executive Summary. IDC Information and Data, March, 2008 http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf

“The diversity of the digital universe can be seen in the variability of file sizes, from 6 gigabyte movies on DVD to 128-bit signals from RFID tags. Because of the growth of VoIP, sensors, and RFID, the number of electronic information “containers” — files, images, packets, tag contents — is growing 50% faster than the number of gigabytes. The information created in 2011 will be contained in more than 20 quadrillion — 20 million billion — of such containers, a tremendous management challenge for both businesses and consumers. alone. “

The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- Executive Summary. IDC Information and Data, March, 2008 http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf

“Data” [epistemic]

“Measurements, observations or descriptions of a referent -- such as an individual, an event, a specimen in a collection or an excavated/surveyed object -- created or collected through human interpretation (whether directly “by hand” or through the use of technologies)”

-- AnthroDPA Working Group on Metadata (May, 2009)

“The General Definition of Information (GDI)”

σ is an instance of information, understood as semantic content, if and only if:

• (GDI.1) σ consists of one or more data;• (GDI.2) the data in σ are well-formed;• (GDI.3) the well-formed data in σ are meaningful.

Luciano Floridi <luciano.floridi@philosophy.ox.ac.uk> “Semantic Conceptions of Information” (First published Wed Oct 5, 2005) Stanford Encyclopedia of Philosophyhttp://plato.stanford.edu/entries/information-semantic/ [visited 11/12/09]

“…with the corollary assumptions that they are objective -- that is, not conditioned by subjective perspectives

and invariant – that is, true under all circumstances.”

-- Draft GBIF DPFTG Report, 2009

SEE: R. Nozick, Invariances: The Structure of the Objective World, Harvard University Press, Cambridge, 2001. AND L. Daston and P. Galison, Objectivity, Zone Books, NY, 2007.

The Diaphoric Definition of Data (DDD):

“According to GDI, information cannot be dataless but, in the simplest case, it can consist of a single datum. Now a datum is reducible to just a lack of uniformity (diaphora is the Greek word for “difference”), so a general definition of a datum is:

The Diaphoric Definition of Data (DDD): A datum is a putative fact regarding some difference or lack of uniformity within some context.

“Depending on philosophical inclinations, DDD can be applied at three levels: 1. data as diaphora de re, that is, as lacks of uniformity in the real world out there. There is no

specific name for such “data in the wild”. A possible suggestion is to refer to them as dedomena (“data” in Greek; note that our word “data” comes from the Latin translation of a work by Euclid entitled Dedomena). Dedomena are not to be confused with environmental data (see section 1.7.1). They are pure data or proto-epistemic data, that is, data before they are epistemically interpreted. As “fractures in the fabric of being” they can only be posited as an external anchor of our information, for dedomena are never accessed or elaborated independently of a level of abstraction (more on this in section 3.2.2). They can be reconstructed as ontological requirements, like Kant's noumena or Locke's substance: they are not epistemically experienced but their presence is empirically inferred from (and required by) experience. Of course, no example can be provided, but dedomena are whatever lack of uniformity in the world is the source of (what looks to information systems like us as) as data, e.g., a red light against a dark background. Note that the point here is not to argue for the existence of such pure data in the wild, but to provide a distinction that (in section 1.6) will help to clarify why some philosophers have been able to accept the thesis that there can be no information without data representation while rejecting the thesis that information requires physical implementation; …”

The Diaphoric Definition of Data (DDD): (cont.)

“2. data as diaphora de signo, that is, lacks of uniformity between (the perception of) at least two physical states, such as a higher or lower charge in a battery, a variable electrical signal in a telephone conversation, or the dot and the line in the Morse alphabet; and

3. data as diaphora de dicto, that is, lacks of uniformity between two symbols, for example the letters A and B in the Latin alphabet.”

Luciano Floridi <luciano.floridi@philosophy.ox.ac.uk> “Semantic Conceptions of Information” (First published Wed Oct 5, 2005) Stanford Encyclopedia of Philosophyhttp://plato.stanford.edu/entries/information-semantic/ [visited 11/12/09]

“Evidence”?

“Data having probative value and authority”?i.e. well supported by scientific logic and considered

technically valid

Policy Formation and Decision Making

Poder Politico y ConocimientoResponsabilidad y Poder

Políticos

Administradores o Gestores

Analistas-Técnicos

Científicos

Conocimiento (en términos científicos-occidentales)Bajo

Alto

Alto

(Sutton, 1999)

From: Organizaciones que aprenden, paises que aprenden: lecciones y AP en Costa Rica by Andrea Ballestero Directora ELAP

???

Wednesday, January 21st, 2009 at 12:00 amMEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIESSUBJECT: Freedom of Information Act A democracy requires accountability, and accountability requires transparency. As Justice Louis Brandeis wrote, "sunlight is said to be the best of disinfectants." In our democracy, the Freedom of Information Act (FOIA), which encourages accountability through transparency, is the most prominent expression of a profound national commitment to ensuring an open Government. At the heart of that commitment is the idea that accountability is in the interest of the Government and the citizenry alike. The Freedom of Information Act should be administered with a clear presumption: In the face of doubt, openness prevails. The Government should not keep information confidential merely because public officials might be embarrassed by disclosure, because errors and failures might be revealed, or because of speculative or abstract fears. Nondisclosure should never be based on an effort to protect the personal interests of Government officials at the expense of those they are supposed to serve. In responding to requests under the FOIA, executive branch agencies (agencies) should act promptly and in a spirit of cooperation, recognizing that such agencies are servants of the public. All agencies should adopt a presumption in favor of disclosure, in order to renew their commitment to the principles embodied in FOIA, and to usher in a new era of open Government. The presumption of disclosure should be applied to all decisions involving FOIA…[clip]

Barack Obama

http://www.whitehouse.gov/the_press_office/Freedom_of_Information_Act/

“Declaration of Scientific Principles” in “The Commonwealth of Science”

“7. The pursuit of scientific inquiry demands complete intellectual freedom. And unrestricted international exchange of knowledge…“

from “The Commonwealth of Science ” Nature No.3753 October 4, 1941.

August 4, 2009: the White House issued a memorandum stating unequivocally “Sound science should inform policy decisions”

“Science and Technology Priorities for the FY2011 Budget,” PR Orszag and JP Holdren August 4, 2009, Memorandum for the Heads of Executive Departments and Agencies, M-09-27. http://www.whitehouse.gov/omb/assets/memoranda_fy2009/m09-27.pdf

The $3.6 billion Large Hadron Collider (LHC) will sample and record the results of up to 600 million proton collisions per second, producing roughly 15 petabytes (15 million gigabytes) of data annually in search of new fundamental particles. To allow thousands of scientists from around the globe to collaborate on the analysis of these data over the next 15 years (the estimated lifetime of the LHC), tens of thousands of computers located around the world are being harnessed in a distributed computing network called the Grid. Within the Grid, described as the most powerful supercomputer system in the world, the avalanche of data will be analyzed, shared, re-purposed and combined in innovative new ways designed to reveal the secrets of the fundamental properties of matter.

LHC source:http://public.web.cern.ch/public/en/LHC/LHC-en.html Source: http://public.web.cern.ch/Public/en/LHC/LHC-en.html

“The Legacy of GenBank: The DNA Sequence Database That Set a Precedent,” 1663: Los

Alamos Science and Technology Magazine August

2008 http://www.lanl.gov/news/1663/images/aug08/22lg.jpg

“The Legacy of GenBank: The DNA Sequence Database That Set a Precedent,” 1663: Los Alamos Science and Technology Magazine August 2008 http://www.lanl.gov/news/1663/images/aug08/22lg.jpg

The (US) NCAR Research Data Archive (RDA)

“The NCAR Research Data Archive (RDA) is a comparatively small (currently 246 TB, less than 5% of the MSS [Mass Storage System] total size), but very important, part of the MSS stored data. The RDA has been curated by the staff in the Computational and Information Systems Laboratory for over 40 years, [emphasis added] and as such contains reference datasets used by large numbers of scientists. The RDA contents are long-term atmospheric (surface and upper air) and oceanographic observations, grid analyses of observational datasets, operational weather prediction model output, reanalyses, satellite derived datasets, and ancillary datasets, such as topography/bathymetry, vegetation, and land use. The RDA is not a static collection; it is now over 580 datasets with about 100 routinely updated and 10-20 new ones added each year. “

C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008, page 5.

www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]

NCAR Research Data Archive (RDA)

C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008 , page 7.

www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]

http://www.ncdc.noaa.gov/img/climate/globalwarming/ar4-fig-3-9.gif

Facebook?

Facebook, for example, uses more than 1 petabyte of storage space to manage its users’ 40 billion photos. (A petabyte is about 1,000 times as large as a terabyte, and could store about 500 billion pages of text.)

Training to Climb an Everest of Digital DataBy ASHLEE VANCE NYT Published: October 11, 2009

http://www.nytimes.com/2009/10/12/technology/12data.html?_r=1

“Vertical section drawing of Cavendish's torsion balance instrument including the building in which it was housed.” http://en.wikipedia.org/wiki/Cavendish_experiment

http://www.newscientist.com/articleimages/mg12016390.100/0-four-fundamental-forces.html

“Experiments to determine the density of the earth,” by Henry Cavendish, ESQ., F.R.S. AND A.S. Read June 21, 1798 (From the Philosophical Transactions of the Royal Society of London for the year

1798, Part II. , pp. 469-526)

From: http://www.archive.org/details/lawsofgravitatio00mackrich

DATASETS

someexamples

with “native metadata”

2-d_soil_temps.csvsurface, and sub-surface soil temperatures (at 2cm and 8cm depths) measured at one location for a few days in order to

calibrate a model of temperature propagation. Surface temperature was measured with an infrared thermometer, subsurface temperatures with a thermocouple.

----------------------------5-minute_light_data_for_4_continuous_days_plus_reference.xlsPPF (photosynthetic photon flux = photosynthetically active radiation 400-700nm) measured with an array of photodiodes

calibrated to a Licor sensor, along a linear transect for a few days. used to get an idea of how much light plants along the transect are receiving.

----------------------------CO2_of_air_at_different_heights_July_9.xlsconcentration of CO2 in the air during the evening for one day, measured with a Licor infrared gas analyzer and a series of

relays and tubes with a pump. used to examine the gradient of CO2 coming from the soil when the air is still during the evening.

----------------------------Fern_light_response.xlsLight response curves for bracken ferns, measured with a Licor photosynthesis system. Fronds are exposed to different light

levels and their instantaneous photosynthesis and conductance is measured. used in conjunction with the induction data (below) for physiological characterization of the ferns.

----------------------------La_Selva_species_photosyntheis_table.xlsincomplete data set on instantaneous photosynthesis rates for various tropical understory and epiphytic species grown in a

shade house in Costa Rica.----------------------------manzanita_sapflow_12-5-07_to_7-7-08.xlsinstantaneous sap flow data (as temperature differences on a constant temperature heat dissipation probe) for multiple

branches of Manzanita, collected with a datalogger. used to correlate physiological activity with below-ground measures of root grown and CO2 production.

----------------------------moisture_release_curves.xlspercentage of water content, water potential (in MegaPascals) and temperature of soil samples, measured in the laboratory

for calibration of water content with water potential. soil is from the James Reserve in California.----------------------------Photosynthetic_induction.xlsa time-course of photosynthetic induction for a leaf over 35 minutes. instantaneous photosynthesis measured as mol CO2 �

m/2/s and light level is probably 1000 micromoles. used to determine physiological characteristics of bracken ferns.----------------------------run_2_24-h_data_for_mesh.xlsmeasurements of micrometeorological parameters on a moving shuttle, going from a clearing across a forest edge and into

the forest for about 30 meters. Pyronometers facing up and down, pyrgeometer facing up and down, PAR, air temperature, relative humidity. Also data from a station fixed in the clearing and some derived variables calculated. used for examining edge effects in forests.

----------------------------Segment_of_wallflower_compare_colorspaces_blur.xlspixel counts from images of wallflowers that were segmented into flower/not-flower under different color spaces.

segmentation was made using a probability matrix of hand-segmented images. used to automatically count flowers in images collected after this training data was collected (and used to determine the best color space for this task).

2 12.365 1196796112 2018.8 0.5585 0.51029 0.55517 0.54354 0.6067 0.52858 0.55351 0.59008 0.59506 0.60337 0.56514 12/4/07 11:21 4.47351 3 12.348 1196796232 2017.9 0.55682 0.51028 0.5535 0.54352 0.60669 0.52857 0.55017 0.59007 0.59505 0.60336 0.56513 12/4/07 11:23 0 4.47490 4 12.357 1196796352 2018.6 0.55514 0.51027 0.55348 0.54351 0.60501 0.52855 0.55016 0.59005 0.59504 0.60501 0.56512 12/4/07 11:25 0 4.47628 5 12.354 1196796472 2017.6 0.55514 0.51026 0.55181 0.5435 0.60334 0.52855 0.54849 0.59004 0.59503 0.60334 0.56511 12/4/07 11:27 0 4.47767 6 12.334 1196796592 2018.3 0.55347 0.51026 0.55015 0.5435 0.60333 0.52854 0.54682 0.59004 0.59502 0.605 0.56511 12/4/07 11:29 0 4.47906 7 12.34 1196796712 2018.5 0.55014 0.50859 0.55014 0.54349 0.60332 0.53019 0.54349 0.59003 0.59501 0.60498 0.56676 12/4/07 11:31 0 4.48045 8 12.337 1196796832 2017.8 0.55013 0.50692 0.55013 0.54348 0.60332 0.53019 0.54182 0.59002 0.59501 0.60498 0.56675 12/4/07 11:33 0 4.48184 9 12.328 1196796952 2017.5 0.5468 0.50691 0.5468 0.54347 0.60331 0.53018 0.53849 0.59001 0.595 0.60497 0.56674 12/4/07 11:35 0 4.48323 10 12.323 1196797072 2017 0.54679 0.50524 0.54679 0.54347 0.59998 0.53017 0.53682 0.59 0.59499 0.60496 0.56674 12/4/07 11:37 0 4.48462 11 12.328 1196797192 2018.9 0.54679 0.50191 0.54512 0.5418 0.59665 0.53017 0.53349 0.59 0.59498 0.60496 0.56673 12/4/07 11:39 0 4.48601 12 12.319 1196797312 2017.7 0.54345 0.49857 0.54178 0.54178 0.59663 0.53015 0.53015 0.58998 0.5933 0.60327 0.56671 12/4/07 11:41 0 4.48740 13 12.311 1196797432 2017.3 0.54343 0.4969 0.54011 0.54177 0.59661 0.53014 0.52848 0.58997 0.59329 0.6016 0.5667 12/4/07 11:43 0 4.48878 14 12.316 1196797552 2018.6 0.5401 0.49357 0.53678 0.54176 0.59328 0.53013 0.5268 0.58995 0.59328 0.60325 0.56669 12/4/07 11:45 0 4.49017 15 12.31 1196797672 2016.8 0.53844 0.4919 0.53511 0.54176 0.59494 0.53013 0.52514 0.58995 0.59328 0.60325 0.56503 12/4/07 11:47 0 4.49156 16 12.31 1196797792 2017.1 0.53676 0.48856 0.53343 0.54174 0.59326 0.53011 0.5218 0.58993 0.59326 0.60323 0.56501 12/4/07 11:49 0 4.49295 17 12.31 1196797912 2017.1 0.53342 0.48523 0.5301 0.54173 0.59324 0.5301 0.51846 0.58826 0.59324 0.60321 0.56499 12/4/07 11:51 0 4.49434 18 12.301 1196798031 2017.5 0.53174 0.48521 0.52842 0.53839 0.59156 0.53008 0.51845 0.58824 0.59323 0.6032 0.56498 12/4/07 11:53 0 4.49573 19 12.301 1196798151 2016.3 0.53007 0.48188 0.52509 0.53838 0.59155 0.53007 0.51512 0.58823 0.59321 0.60152 0.5633 12/4/07 11:55 0 4.49712

20 12.303 1196798271 2016.6 0.5284 0.47855 0.52175 0.53837 0.59154 0.5284 0.5151 0.58821 0.59154 0.60151 0.56163 12/4/07 11:57 0 4.49851

sbid battery datetime heater_voltage Manz1Sap1 Manz1Sap2 Manz1Sap3 Manz1Sap4 Manz2Sap5 Manz2Sap6 Manz2Sap7 Manz3Sap10 Manz3Sap8 Manz3Sap9 Manz4Sap11 timestamp Datagap Julian

manzanita_sapflow_12-5-07_to_7-7-08.xlsinstantaneous sap flow data (as temperature differences on a constant temperature heat dissipation probe) for multiple branches of Manzanita, collected with a datalogger. used to correlate physiological activity with below-ground measures of root grown and CO2 production.

Datum: “0.59998”

“Jim Gray on eScience: A Transformed Scientific Method” T. Hey, S. Tansley, and K.Tolle (eds)| Microsoft Research Based on the transcript of a talk given by Jim Gray to the NRC-CSTB1 in Mountain View, CA, on

January 11, 2007http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_jim_gray_transcript.pdf

“Reanalyses” [or Meta-Analyses ]

“Atmospheric reanalyses are a main feature within the RDA and were intended to be, and have become, a very valuable data resource for a wide variety of climate and weather studies. By combining many types of atmospheric observations with advanced data assimilation and forecast models a “best possible” 3D estimate of the atmospheric state over extended time periods is achieved.

“Reanalyses are supported by many historical data sources that have been curated over time. As an illustration the major sources of atmospheric profile data include wind only soundings beginning in 1920 (Figure 2). These are augmented with soundings of temperature, humidity, and wind beginning in 1948. “

C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008, page 6.

www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]

Fundamental Questions:• Data Specification – scientific logic of data

definition• Data Creation – specification of methodology• Data Integrity – preservation -- “chain of

custody” “Chain of custody refers to the chronological documentation or paper trail, showing the seizure, custody, control, transfer, analysis, and disposition of evidence, physical or electronic.”

[ http://en.wikipedia.org/wiki/Chain_of_custody [clipped 11/12/09 10:30pm PST]

• Data transformations– Logic– Competence /Technical Performance / Execution

“Keeping Raw Data in Context”“…any initiative to share raw clinical research data must also pay close attention to sharing clear

and complete information about the design of the original studies. Relying on journal articles for study design information is problematic, for three reasons. First, journal articles often provide insufficient detail when describing key study design features such as randomization (1) and intervention details (2). Second, some data sets may come from studies with no publications [only 21% of oncology trials registered in ClinicalTrials.gov before 2004 and completed by September 2007 were published (3)]. Finally, investigators cannot reliably search journal articles for methodological concepts like “double blinding” or “interrupted time series,” crucial concepts for proper interpretation of the data. A mishmash of non-standardized databases of raw results and unevenly reported study designs is not a strong foundation for clinical research data sharing. “

“ We believe that the effective sharing of clinical research data requires the establishment of an interoperable federated database system that includes both study design and results data. A key component of this system is a logical model of clinical study characteristics in which all the data elements are standardized to controlled vocabularies and common ontologies to

facilitate cross-study comparison and synthesis. “

I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713.

“Increasing levels of coordinate digit noise associated with repeated projection transformations”

Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and Information Content". In Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23, 2005.

“A careful look at the coordinate digits stored as double precision variables in a GIS yields a variety of interesting patterns that are a result of previous machine error, rounding error, measurement error, and so forth. Any slight cartographic alteration (rotation/skewing, clipping/sub-setting, reprojecting, etc.) can add noise into the coordinate and can be used to characterize a vector dataset.”

“It is well known that cartographic coordinates stored in double precision are far more precisely specified than is merited by their accuracy, even for highly-accurate global datasets. Far more coordinate digit places are stored for the sake of avoiding machine error than are needed to define the location of map objects within the necessary tolerances for both absolute and relative accuracies.”

Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and Information Content". In Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23, 2005.

Individual Libraries

Cooperative Projects

National Disciplinary Initiatives

“BIG Science”“Small Science”

Local / Personal Archiving

International Collaborative Research Effort

Individuals

Data Centers

GRIDS

“Small Science”?

The “small science,” independent investigator approach traditionally has characterized a large area of experimental laboratory sciences, such as chemistry or biomedical research, and field work and studies, such as biodiversity, ecology, microbiology, soil science, and anthropology. The data or samples are collected and analyzed independentlycollected and analyzed independently, and the resulting data sets from such studies generally are heterogeneous and unstandardizedheterogeneous and unstandardized, with few of the individual data holdings deposited in public data repositories or openly shared. The data exist in various twilight exist in various twilight states of accessibilitystates of accessibility, depending on the extent to which they are published, discussed in papers but not revealed, or just known about because of reputation or ongoing work, but kept under absolute or relative secrecy. The data are thus data are thus disaggregated components of an incipient network that disaggregated components of an incipient network that is only as effective as the individual transactions that is only as effective as the individual transactions that put it togetherput it together. Openness and sharing are not ignored, but they are not necessarily dominant either. These values must compete with strategic considerations of self-interest, secrecy, and the logic of mutually beneficial exchange, particularly in areas of research in which commercial applications are more readily identifiable.

The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie M. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8

http://www.loc.gov/exhibits/dres/dre123.jpg

Maria Sibylla Merian Metamorphosis insectorum Surinamensium (Metamorphosis of the Insects of Surinam) Amsterdam, 1705, figure 46 Hand-colored engraving (123)

http://www.nyu.edu/projects/materialworld/images/

1_Darwin%20Tree%20B%2036.jpg

http://darwin-online.org.uk/converted/published/1975_NaturalSelection_F1583/1975_NaturalSelection_F1583_fig03.jpg

DARWIN

http://diglib1.amnh.org/cgi-bin/database/index.cgi

FIELD NOTES FROM THE AMERICAN MUSEM CONGO EXPEDITION 1909-1915

http://diglib1.amnh.org/galleries/bats/taphozous_mauritianus.html

Rheinardia ocellata, the Crested Argus. Photographed at night by an automatic camera-trap in the Ngoc Linh foothills (Quang Nam Province).

Courtesy AMNH Center for Biodiversity and Conservation

http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14

By Serge Bloch in NYT: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 SEE:

http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=cse

How many data sources contributed to this analysis?

The “small science,” independent investigator approach traditionally has characterized a large area of experimental laboratory sciences, such as chemistry or biomedical research, and field work and studies, such as biodiversity, ecology, microbiology, soil science, and anthropology. The data or samples are collected and analyzed independentlycollected and analyzed independently, and the resulting data sets from such studies generally are heterogeneous and unstandardizedheterogeneous and unstandardized, with few of the individual data holdings deposited in public data repositories or openly shared. The data exist in various twilight exist in various twilight states of accessibilitystates of accessibility, depending on the extent to which they are published, discussed in papers but not revealed, or just known about because of reputation or ongoing work, but kept under absolute or relative secrecy. The data are thus data are thus disaggregated components of an incipient network that disaggregated components of an incipient network that is only as effective as the individual transactions that is only as effective as the individual transactions that put it togetherput it together. Openness and sharing are not ignored, but they are not necessarily dominant either. These values must compete with strategic considerations of self-interest, secrecy, and the logic of mutually beneficial exchange, particularly in areas of research in which commercial applications are more readily identifiable.

The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie M. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8

Individual Libraries

Cooperative Projects

National Disciplinary Initiatives

“BIG Science”“Small Science”

Local / Personal Archiving

International Collaborative Research Effort

Individuals

Data Centers

GRIDS

Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper,

OECD Publishing. doi: 10.1787/603233448430 http://dx.doi.org/10.1787/603233448430

http://ocde.p4.siteinternet.com/publications/doifiles/publishing-standards-data-2009.pdf

Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper, OECD Publishing. doi: 10.1787/603233448430 http://dx.doi.org/10.1787/603233448430

http://ocde.p4.siteinternet.com/publications/doifiles/publishing-standards-data-2009.pdf

What does “Full Life-Cycle” Data Management Mean ?

US NSF “DataNet” Program“the full data preservation and access lifecycle”

• “acquisition” • “documentation”• “protection” • “access” • “analysis and dissemination” • “migration” • “disposition”

“Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation” NSF 07-601 US National Science Foundation Office of Cyberinfrastructure Directorate for Computer & Information

Science & Engineering

Incentives?

How do we Incentivize Change ?

• Individuals• Professions / Disciplines• Organizations • Institutions (Universities, Research Institutes,

Museums, Gardens, Herbaria, Aquariums, Zoos)• “Memory Institutions” (Libraries, Archives)• Governments• Funders / Sponsors• Publishers!

Individual’s willingness to share: the Core functions of Scholarly Communication

• “Registration, which allows claims of precedence for a scholarly finding.

• “Certification, which establishes the validity of a registered scholarly claim.

• “Awareness, which allows participants in the scholarly system to remain aware of new claims and findings.

• “Archiving, which preserves the scholarly record over time. • “Rewarding, which rewards participants for their

performance in the communication system based on metrics derived from that system.

Roosendaal, H., Geurts, P in Cooperative Research Information Systems in Physics (Oldenburg, Germany, 1997).

Professional / Disciplinary Incentives?

• Norms and standards for sharing vary by discipline • In “big science” (astrophysics / astronomy /

meteorology / oceanography / genomics) sharing is expected (if not required) and contributions to a common fund of knowledge are assumed (See also: GENBANK )

– Standards are relatively clear– Mechanisms for sharing are well-developed– Collective / collaborative authorship is commonplace

• In “small science” such norms are weaker

Small Science: Data Deposit and Access

• Data are typically held in many formats • Discovery of data is very weakly supported by

standards-development• Access to and use of data are highly variable• [ However progress has been made respecting

museum specimen data in the past 20 years [SEE for ex. : GBIF and many allied projects] ]

• Some progress has been made respecting observational and other data

• Ecological and conservation field data remain highly problematic

Some suggestions for action include:

government agencies and private foundations must both set strict requirements for effective sharing – with serious penalties (such as disqualification for future research funding) for failures to share;

• peer review processes must include rigorous scrutiny of past histories of sharing and must require state-of-the-art planning for sharing (not simply a promise to “put data up on the Web” ];

• negotiations for “overhead” (“indirect costs”) compensation from funders must include examination of digital infrastructure adequate for sharing and maintenance of data;

• accreditation bodies for educational institutions and museums must start to require demonstrated evidence of capacity to support digital access and maintenance of data;

• professional societies and professional disciplines must begin to require evidence of effective sharing of data in evaluating credentials for hiring, promotion and tenure;

http://www.mikero.com/blog/2009/02/20/more-darwinhttp://www.zazzle.com/darwin2009

From: Tom Moritz [mailto:tom.moritz@gmail.com]Sent: Thursday, November 12, 2009 1:46 AMTo: Donat AgostiSubject: Snapple Real Fact #134: " An ant can lift 50 times its own weight. ”

Is this true?Tom ________________________________________________From: Donat Agosti <agosti@amnh.org>Date: Wed, Nov 11, 2009 at 8:03 PMSubject: RE: Snapple Real Fact #134: " An ant can lift 50 times its own weight. "To: Tom Moritz tom.moritz@gmail.com

People says so [emphasis added] – but we once looked for the evidence, but could not find a scientific paper confirming this.

D

"They [the hippopotami] present the following appearance; four-footed, with cloven hooves like cattle; blunt-nosed; with a horse's mane, visible tusks, a horse's tail and voice; big as the biggest bull. Their hide is so thick that, when it is dried, spearshafts are made of it.” Herodotus, The Histories (with an English translation by A. D. Godley). Cambridge. Harvard University Press. 1920. LXXI

http://old.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus%3Aabo%3Atlg%2C0016%2C001&query=2%3A71%3A1 [clipped 11/12/09]

Iobi Ludolfi aliàs Leut-holf dicti Historia Æthiopica, sive Brevis & succincta descriptio regni Habessinorum, quod vulgò malè Presbyteri Iohannis vocatur : 2009 Cambridge University Library

“…the great trouble with the world was that which survived was held in hard evidence as to past events. A false authority clung to what persisted, as if those artifacts of the past which had endured had done so by some act of their own will.”

-- Cormac McCarthy The Crossing

a problem with “evidence”…

“Πάντα ῥεῖ καὶ οὐδὲν μένει” Heraclitus: “Everything flows, nothing stands still.”

All data is dynamic

From examination of elephants’ skulls the early Greeks deduced that a species of humanoid Cyclops existed…

(SEE -- for example -- The Odyssey and Ulysses encounter with Polyphemus on the island of Sicily… )

http://www.amnh.org/exhibitions/mythiccreatures/land/greek.php

Another deduction from the evidence of narwhal tusks…“In the Middle Ages, narwhal tusks were widely thought to be unicorn horns with magical, curative properties. Indeed, cups made from narwhal tusks (above) were thought to neutralize poisons and were highly valued. “

http://www.amnh.org/exhibitions/mythiccreatures/land/unicorns.php

Kirtland’s Warbler / Abaco Island, The Bahamas

and 5 CALIFORNIA

CONDORS !!!

DEAD HARBOR SEAL

“NATIVE” METADATA

top related