uc berkeley ischool nov, 2009

78
“Vertical section drawing of Cavendish's torsion balance instrument including the building in which it was housed.” http://en.wikipedia.org/wiki/Cavendish_experiment Hinges and Loops? -- Data as Evidence I-School UC, Berkeley November 13, 2009

Upload: tom-moritz

Post on 18-Dec-2014

1.353 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: UC Berkeley iSchool Nov, 2009

“Vertical section drawing of Cavendish's torsion balance instrument including the building in which it was housed.” http://en.wikipedia.org/wiki/Cavendish_experiment

Hinges and Loops? -- Data as Evidence

I-School UC, BerkeleyNovember 13, 2009

Page 2: UC Berkeley iSchool Nov, 2009

The Tragedy of Othello: The Moor of Venice (Act 3 Scene 3)

“Othello: ‘Villain: be sure thou prove my love a whore; Be sure of it; give me the ocular proof; Or by the worth of man’s eternal soul, Thou hadst been better born a dog Than answer my naked wrath!

Iago: ‘Is’t come to this?’

Othello: ‘Make me to see‘t ; or at the least so prove it, That the probation bear no hinge nor loop To hang doubt on; or woe upon thy life!’ “

Page 3: UC Berkeley iSchool Nov, 2009

“So the universe has always appeared to the natural mind as a kind of enigma, of which the key must be sought in the shape of some illuminating or power-bringing word or name. That word names the universe's principle, and to possess it is, after a fashion, to possess the universe

itself 'God,' 'Matter,' 'Reason,’ 'the Absolute,’ ‘Energy,’ are so many solving names.

You can rest when you have them. You are at the end of your metaphysical quest.”

William James. "What Pragmatism Means". Lecture 2 in Pragmatism: A new name for some old ways of thinking. New York: Longman Green and Co (1922): 52-52.

http://www.archive.org/stream/pragmatismnewnam00jame

Page 4: UC Berkeley iSchool Nov, 2009

Internet Archive: http://www.archive.org/stream/pragmatismnewnam00jame

Note Date of Publiction: 1922

Page 5: UC Berkeley iSchool Nov, 2009

Clear definitions are good (!) We should not reflexively rely on metaphysical

“solving” / “power-bringing” words…ADD to James’s list?: “Knowledge” “Information” “Data” ???

Page 6: UC Berkeley iSchool Nov, 2009
Page 7: UC Berkeley iSchool Nov, 2009

“Data” ?

Page 8: UC Berkeley iSchool Nov, 2009

Usage Data: The word data is the Latin plural of datum, neuter past participle

of dare, "to give", hence "something given".

“ Data leads a life of its own quite independent of datum, of which it was originally the plural. It occurs in two constructions: as a plural noun (like earnings), taking a plural verb and plural modifiers (as these, many, a few) but not cardinal numbers, and serving as a referent for plural pronouns; and as an abstract mass noun (like information), taking a singular verb and singular modifiers (as this, much, little), and being referred to by a singular pronoun. Both constructions are standard. The plural construction is more common in print, perhaps because the house style of some publishers mandates it.”

The Merriam-Webster Online Dictionary http://www.merriam-webster.com/dictionary/data

Page 9: UC Berkeley iSchool Nov, 2009

“Data” ? [technological]

“…’data’ are defined as any information that can be stored in digital form and accessed electronically, including, but not limited to, numeric data, text, publications, sensor streams, video, audio, algorithms, software, models and simulations, images, etc.” -- Program Solicitation 07-601 “Sustainable Digital Data Preservation and Access Network Partners (DataNet)”

Taken in this broadest possible sense, “data” are thus simply electronic coded forms of information. And virtually anything can be represented as “data” so long as it is electronically

machine-readable.

Page 10: UC Berkeley iSchool Nov, 2009

“The digital universe in 2007 — at 2.25 x 1021bits (281 exabytes or 281 billion gigabytes) — was 10% bigger than we thought. The resizing comes as a result of faster growth in cameras, digital TV shipments, and better understanding of information replication.

“By 2011, the digital universe will be 10 times the size it was in 2006.

“As forecast, the amount of information created, captured, or replicated exceeded available storage for the first time in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home.

“Fast-growing corners of the digital universe include those related to digital TV, surveillance cameras, Internet access in emerging countries, sensor-based applications, datacenters supporting “cloud computing,” and social networks.

The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- Executive Summary. IDC Information and Data, March, 2008 http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf

Page 11: UC Berkeley iSchool Nov, 2009

“The diversity of the digital universe can be seen in the variability of file sizes, from 6 gigabyte movies on DVD to 128-bit signals from RFID tags. Because of the growth of VoIP, sensors, and RFID, the number of electronic information “containers” — files, images, packets, tag contents — is growing 50% faster than the number of gigabytes. The information created in 2011 will be contained in more than 20 quadrillion — 20 million billion — of such containers, a tremendous management challenge for both businesses and consumers. alone. “

The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- Executive Summary. IDC Information and Data, March, 2008 http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf

Page 12: UC Berkeley iSchool Nov, 2009

“Data” [epistemic]

“Measurements, observations or descriptions of a referent -- such as an individual, an event, a specimen in a collection or an excavated/surveyed object -- created or collected through human interpretation (whether directly “by hand” or through the use of technologies)”

-- AnthroDPA Working Group on Metadata (May, 2009)

Page 13: UC Berkeley iSchool Nov, 2009

“The General Definition of Information (GDI)”

σ is an instance of information, understood as semantic content, if and only if:

• (GDI.1) σ consists of one or more data;• (GDI.2) the data in σ are well-formed;• (GDI.3) the well-formed data in σ are meaningful.

Luciano Floridi <[email protected]> “Semantic Conceptions of Information” (First published Wed Oct 5, 2005) Stanford Encyclopedia of Philosophyhttp://plato.stanford.edu/entries/information-semantic/ [visited 11/12/09]

Page 14: UC Berkeley iSchool Nov, 2009

“…with the corollary assumptions that they are objective -- that is, not conditioned by subjective perspectives

and invariant – that is, true under all circumstances.”

-- Draft GBIF DPFTG Report, 2009

SEE: R. Nozick, Invariances: The Structure of the Objective World, Harvard University Press, Cambridge, 2001. AND L. Daston and P. Galison, Objectivity, Zone Books, NY, 2007.

Page 15: UC Berkeley iSchool Nov, 2009

The Diaphoric Definition of Data (DDD):

“According to GDI, information cannot be dataless but, in the simplest case, it can consist of a single datum. Now a datum is reducible to just a lack of uniformity (diaphora is the Greek word for “difference”), so a general definition of a datum is:

The Diaphoric Definition of Data (DDD): A datum is a putative fact regarding some difference or lack of uniformity within some context.

“Depending on philosophical inclinations, DDD can be applied at three levels: 1. data as diaphora de re, that is, as lacks of uniformity in the real world out there. There is no

specific name for such “data in the wild”. A possible suggestion is to refer to them as dedomena (“data” in Greek; note that our word “data” comes from the Latin translation of a work by Euclid entitled Dedomena). Dedomena are not to be confused with environmental data (see section 1.7.1). They are pure data or proto-epistemic data, that is, data before they are epistemically interpreted. As “fractures in the fabric of being” they can only be posited as an external anchor of our information, for dedomena are never accessed or elaborated independently of a level of abstraction (more on this in section 3.2.2). They can be reconstructed as ontological requirements, like Kant's noumena or Locke's substance: they are not epistemically experienced but their presence is empirically inferred from (and required by) experience. Of course, no example can be provided, but dedomena are whatever lack of uniformity in the world is the source of (what looks to information systems like us as) as data, e.g., a red light against a dark background. Note that the point here is not to argue for the existence of such pure data in the wild, but to provide a distinction that (in section 1.6) will help to clarify why some philosophers have been able to accept the thesis that there can be no information without data representation while rejecting the thesis that information requires physical implementation; …”

Page 16: UC Berkeley iSchool Nov, 2009

The Diaphoric Definition of Data (DDD): (cont.)

“2. data as diaphora de signo, that is, lacks of uniformity between (the perception of) at least two physical states, such as a higher or lower charge in a battery, a variable electrical signal in a telephone conversation, or the dot and the line in the Morse alphabet; and

3. data as diaphora de dicto, that is, lacks of uniformity between two symbols, for example the letters A and B in the Latin alphabet.”

Luciano Floridi <[email protected]> “Semantic Conceptions of Information” (First published Wed Oct 5, 2005) Stanford Encyclopedia of Philosophyhttp://plato.stanford.edu/entries/information-semantic/ [visited 11/12/09]

Page 17: UC Berkeley iSchool Nov, 2009

“Evidence”?

“Data having probative value and authority”?i.e. well supported by scientific logic and considered

technically valid

Page 18: UC Berkeley iSchool Nov, 2009

Policy Formation and Decision Making

Page 19: UC Berkeley iSchool Nov, 2009

Poder Politico y ConocimientoResponsabilidad y Poder

Políticos

Administradores o Gestores

Analistas-Técnicos

Científicos

Conocimiento (en términos científicos-occidentales)Bajo

Alto

Alto

(Sutton, 1999)

From: Organizaciones que aprenden, paises que aprenden: lecciones y AP en Costa Rica by Andrea Ballestero Directora ELAP

???

Page 20: UC Berkeley iSchool Nov, 2009

Wednesday, January 21st, 2009 at 12:00 amMEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIESSUBJECT: Freedom of Information Act A democracy requires accountability, and accountability requires transparency. As Justice Louis Brandeis wrote, "sunlight is said to be the best of disinfectants." In our democracy, the Freedom of Information Act (FOIA), which encourages accountability through transparency, is the most prominent expression of a profound national commitment to ensuring an open Government. At the heart of that commitment is the idea that accountability is in the interest of the Government and the citizenry alike. The Freedom of Information Act should be administered with a clear presumption: In the face of doubt, openness prevails. The Government should not keep information confidential merely because public officials might be embarrassed by disclosure, because errors and failures might be revealed, or because of speculative or abstract fears. Nondisclosure should never be based on an effort to protect the personal interests of Government officials at the expense of those they are supposed to serve. In responding to requests under the FOIA, executive branch agencies (agencies) should act promptly and in a spirit of cooperation, recognizing that such agencies are servants of the public. All agencies should adopt a presumption in favor of disclosure, in order to renew their commitment to the principles embodied in FOIA, and to usher in a new era of open Government. The presumption of disclosure should be applied to all decisions involving FOIA…[clip]

Barack Obama

http://www.whitehouse.gov/the_press_office/Freedom_of_Information_Act/

Page 21: UC Berkeley iSchool Nov, 2009

“Declaration of Scientific Principles” in “The Commonwealth of Science”

“7. The pursuit of scientific inquiry demands complete intellectual freedom. And unrestricted international exchange of knowledge…“

from “The Commonwealth of Science ” Nature No.3753 October 4, 1941.

Page 22: UC Berkeley iSchool Nov, 2009

August 4, 2009: the White House issued a memorandum stating unequivocally “Sound science should inform policy decisions”

“Science and Technology Priorities for the FY2011 Budget,” PR Orszag and JP Holdren August 4, 2009, Memorandum for the Heads of Executive Departments and Agencies, M-09-27. http://www.whitehouse.gov/omb/assets/memoranda_fy2009/m09-27.pdf

Page 23: UC Berkeley iSchool Nov, 2009

The $3.6 billion Large Hadron Collider (LHC) will sample and record the results of up to 600 million proton collisions per second, producing roughly 15 petabytes (15 million gigabytes) of data annually in search of new fundamental particles. To allow thousands of scientists from around the globe to collaborate on the analysis of these data over the next 15 years (the estimated lifetime of the LHC), tens of thousands of computers located around the world are being harnessed in a distributed computing network called the Grid. Within the Grid, described as the most powerful supercomputer system in the world, the avalanche of data will be analyzed, shared, re-purposed and combined in innovative new ways designed to reveal the secrets of the fundamental properties of matter.

LHC source:http://public.web.cern.ch/public/en/LHC/LHC-en.html Source: http://public.web.cern.ch/Public/en/LHC/LHC-en.html

Page 24: UC Berkeley iSchool Nov, 2009

“The Legacy of GenBank: The DNA Sequence Database That Set a Precedent,” 1663: Los

Alamos Science and Technology Magazine August

2008 http://www.lanl.gov/news/1663/images/aug08/22lg.jpg

Page 25: UC Berkeley iSchool Nov, 2009

“The Legacy of GenBank: The DNA Sequence Database That Set a Precedent,” 1663: Los Alamos Science and Technology Magazine August 2008 http://www.lanl.gov/news/1663/images/aug08/22lg.jpg

Page 26: UC Berkeley iSchool Nov, 2009

The (US) NCAR Research Data Archive (RDA)

“The NCAR Research Data Archive (RDA) is a comparatively small (currently 246 TB, less than 5% of the MSS [Mass Storage System] total size), but very important, part of the MSS stored data. The RDA has been curated by the staff in the Computational and Information Systems Laboratory for over 40 years, [emphasis added] and as such contains reference datasets used by large numbers of scientists. The RDA contents are long-term atmospheric (surface and upper air) and oceanographic observations, grid analyses of observational datasets, operational weather prediction model output, reanalyses, satellite derived datasets, and ancillary datasets, such as topography/bathymetry, vegetation, and land use. The RDA is not a static collection; it is now over 580 datasets with about 100 routinely updated and 10-20 new ones added each year. “

C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008, page 5.

www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]

Page 27: UC Berkeley iSchool Nov, 2009

NCAR Research Data Archive (RDA)

C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008 , page 7.

www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]

Page 28: UC Berkeley iSchool Nov, 2009

http://www.ncdc.noaa.gov/img/climate/globalwarming/ar4-fig-3-9.gif

Page 29: UC Berkeley iSchool Nov, 2009

Facebook?

Facebook, for example, uses more than 1 petabyte of storage space to manage its users’ 40 billion photos. (A petabyte is about 1,000 times as large as a terabyte, and could store about 500 billion pages of text.)

Training to Climb an Everest of Digital DataBy ASHLEE VANCE NYT Published: October 11, 2009

http://www.nytimes.com/2009/10/12/technology/12data.html?_r=1

Page 30: UC Berkeley iSchool Nov, 2009

“Vertical section drawing of Cavendish's torsion balance instrument including the building in which it was housed.” http://en.wikipedia.org/wiki/Cavendish_experiment

Page 31: UC Berkeley iSchool Nov, 2009

http://www.newscientist.com/articleimages/mg12016390.100/0-four-fundamental-forces.html

Page 32: UC Berkeley iSchool Nov, 2009

“Experiments to determine the density of the earth,” by Henry Cavendish, ESQ., F.R.S. AND A.S. Read June 21, 1798 (From the Philosophical Transactions of the Royal Society of London for the year

1798, Part II. , pp. 469-526)

From: http://www.archive.org/details/lawsofgravitatio00mackrich

Page 33: UC Berkeley iSchool Nov, 2009

DATASETS

someexamples

with “native metadata”

2-d_soil_temps.csvsurface, and sub-surface soil temperatures (at 2cm and 8cm depths) measured at one location for a few days in order to

calibrate a model of temperature propagation. Surface temperature was measured with an infrared thermometer, subsurface temperatures with a thermocouple.

----------------------------5-minute_light_data_for_4_continuous_days_plus_reference.xlsPPF (photosynthetic photon flux = photosynthetically active radiation 400-700nm) measured with an array of photodiodes

calibrated to a Licor sensor, along a linear transect for a few days. used to get an idea of how much light plants along the transect are receiving.

----------------------------CO2_of_air_at_different_heights_July_9.xlsconcentration of CO2 in the air during the evening for one day, measured with a Licor infrared gas analyzer and a series of

relays and tubes with a pump. used to examine the gradient of CO2 coming from the soil when the air is still during the evening.

----------------------------Fern_light_response.xlsLight response curves for bracken ferns, measured with a Licor photosynthesis system. Fronds are exposed to different light

levels and their instantaneous photosynthesis and conductance is measured. used in conjunction with the induction data (below) for physiological characterization of the ferns.

----------------------------La_Selva_species_photosyntheis_table.xlsincomplete data set on instantaneous photosynthesis rates for various tropical understory and epiphytic species grown in a

shade house in Costa Rica.----------------------------manzanita_sapflow_12-5-07_to_7-7-08.xlsinstantaneous sap flow data (as temperature differences on a constant temperature heat dissipation probe) for multiple

branches of Manzanita, collected with a datalogger. used to correlate physiological activity with below-ground measures of root grown and CO2 production.

----------------------------moisture_release_curves.xlspercentage of water content, water potential (in MegaPascals) and temperature of soil samples, measured in the laboratory

for calibration of water content with water potential. soil is from the James Reserve in California.----------------------------Photosynthetic_induction.xlsa time-course of photosynthetic induction for a leaf over 35 minutes. instantaneous photosynthesis measured as mol CO2 �

m/2/s and light level is probably 1000 micromoles. used to determine physiological characteristics of bracken ferns.----------------------------run_2_24-h_data_for_mesh.xlsmeasurements of micrometeorological parameters on a moving shuttle, going from a clearing across a forest edge and into

the forest for about 30 meters. Pyronometers facing up and down, pyrgeometer facing up and down, PAR, air temperature, relative humidity. Also data from a station fixed in the clearing and some derived variables calculated. used for examining edge effects in forests.

----------------------------Segment_of_wallflower_compare_colorspaces_blur.xlspixel counts from images of wallflowers that were segmented into flower/not-flower under different color spaces.

segmentation was made using a probability matrix of hand-segmented images. used to automatically count flowers in images collected after this training data was collected (and used to determine the best color space for this task).

Page 34: UC Berkeley iSchool Nov, 2009

2 12.365 1196796112 2018.8 0.5585 0.51029 0.55517 0.54354 0.6067 0.52858 0.55351 0.59008 0.59506 0.60337 0.56514 12/4/07 11:21 4.47351 3 12.348 1196796232 2017.9 0.55682 0.51028 0.5535 0.54352 0.60669 0.52857 0.55017 0.59007 0.59505 0.60336 0.56513 12/4/07 11:23 0 4.47490 4 12.357 1196796352 2018.6 0.55514 0.51027 0.55348 0.54351 0.60501 0.52855 0.55016 0.59005 0.59504 0.60501 0.56512 12/4/07 11:25 0 4.47628 5 12.354 1196796472 2017.6 0.55514 0.51026 0.55181 0.5435 0.60334 0.52855 0.54849 0.59004 0.59503 0.60334 0.56511 12/4/07 11:27 0 4.47767 6 12.334 1196796592 2018.3 0.55347 0.51026 0.55015 0.5435 0.60333 0.52854 0.54682 0.59004 0.59502 0.605 0.56511 12/4/07 11:29 0 4.47906 7 12.34 1196796712 2018.5 0.55014 0.50859 0.55014 0.54349 0.60332 0.53019 0.54349 0.59003 0.59501 0.60498 0.56676 12/4/07 11:31 0 4.48045 8 12.337 1196796832 2017.8 0.55013 0.50692 0.55013 0.54348 0.60332 0.53019 0.54182 0.59002 0.59501 0.60498 0.56675 12/4/07 11:33 0 4.48184 9 12.328 1196796952 2017.5 0.5468 0.50691 0.5468 0.54347 0.60331 0.53018 0.53849 0.59001 0.595 0.60497 0.56674 12/4/07 11:35 0 4.48323 10 12.323 1196797072 2017 0.54679 0.50524 0.54679 0.54347 0.59998 0.53017 0.53682 0.59 0.59499 0.60496 0.56674 12/4/07 11:37 0 4.48462 11 12.328 1196797192 2018.9 0.54679 0.50191 0.54512 0.5418 0.59665 0.53017 0.53349 0.59 0.59498 0.60496 0.56673 12/4/07 11:39 0 4.48601 12 12.319 1196797312 2017.7 0.54345 0.49857 0.54178 0.54178 0.59663 0.53015 0.53015 0.58998 0.5933 0.60327 0.56671 12/4/07 11:41 0 4.48740 13 12.311 1196797432 2017.3 0.54343 0.4969 0.54011 0.54177 0.59661 0.53014 0.52848 0.58997 0.59329 0.6016 0.5667 12/4/07 11:43 0 4.48878 14 12.316 1196797552 2018.6 0.5401 0.49357 0.53678 0.54176 0.59328 0.53013 0.5268 0.58995 0.59328 0.60325 0.56669 12/4/07 11:45 0 4.49017 15 12.31 1196797672 2016.8 0.53844 0.4919 0.53511 0.54176 0.59494 0.53013 0.52514 0.58995 0.59328 0.60325 0.56503 12/4/07 11:47 0 4.49156 16 12.31 1196797792 2017.1 0.53676 0.48856 0.53343 0.54174 0.59326 0.53011 0.5218 0.58993 0.59326 0.60323 0.56501 12/4/07 11:49 0 4.49295 17 12.31 1196797912 2017.1 0.53342 0.48523 0.5301 0.54173 0.59324 0.5301 0.51846 0.58826 0.59324 0.60321 0.56499 12/4/07 11:51 0 4.49434 18 12.301 1196798031 2017.5 0.53174 0.48521 0.52842 0.53839 0.59156 0.53008 0.51845 0.58824 0.59323 0.6032 0.56498 12/4/07 11:53 0 4.49573 19 12.301 1196798151 2016.3 0.53007 0.48188 0.52509 0.53838 0.59155 0.53007 0.51512 0.58823 0.59321 0.60152 0.5633 12/4/07 11:55 0 4.49712

20 12.303 1196798271 2016.6 0.5284 0.47855 0.52175 0.53837 0.59154 0.5284 0.5151 0.58821 0.59154 0.60151 0.56163 12/4/07 11:57 0 4.49851

sbid battery datetime heater_voltage Manz1Sap1 Manz1Sap2 Manz1Sap3 Manz1Sap4 Manz2Sap5 Manz2Sap6 Manz2Sap7 Manz3Sap10 Manz3Sap8 Manz3Sap9 Manz4Sap11 timestamp Datagap Julian

manzanita_sapflow_12-5-07_to_7-7-08.xlsinstantaneous sap flow data (as temperature differences on a constant temperature heat dissipation probe) for multiple branches of Manzanita, collected with a datalogger. used to correlate physiological activity with below-ground measures of root grown and CO2 production.

Datum: “0.59998”

Page 35: UC Berkeley iSchool Nov, 2009

“Jim Gray on eScience: A Transformed Scientific Method” T. Hey, S. Tansley, and K.Tolle (eds)| Microsoft Research Based on the transcript of a talk given by Jim Gray to the NRC-CSTB1 in Mountain View, CA, on

January 11, 2007http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_jim_gray_transcript.pdf

Page 36: UC Berkeley iSchool Nov, 2009

“Reanalyses” [or Meta-Analyses ]

“Atmospheric reanalyses are a main feature within the RDA and were intended to be, and have become, a very valuable data resource for a wide variety of climate and weather studies. By combining many types of atmospheric observations with advanced data assimilation and forecast models a “best possible” 3D estimate of the atmospheric state over extended time periods is achieved.

“Reanalyses are supported by many historical data sources that have been curated over time. As an illustration the major sources of atmospheric profile data include wind only soundings beginning in 1920 (Figure 2). These are augmented with soundings of temperature, humidity, and wind beginning in 1948. “

C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008, page 6.

www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]

Page 37: UC Berkeley iSchool Nov, 2009

Fundamental Questions:• Data Specification – scientific logic of data

definition• Data Creation – specification of methodology• Data Integrity – preservation -- “chain of

custody” “Chain of custody refers to the chronological documentation or paper trail, showing the seizure, custody, control, transfer, analysis, and disposition of evidence, physical or electronic.”

[ http://en.wikipedia.org/wiki/Chain_of_custody [clipped 11/12/09 10:30pm PST]

• Data transformations– Logic– Competence /Technical Performance / Execution

Page 38: UC Berkeley iSchool Nov, 2009

“Keeping Raw Data in Context”“…any initiative to share raw clinical research data must also pay close attention to sharing clear

and complete information about the design of the original studies. Relying on journal articles for study design information is problematic, for three reasons. First, journal articles often provide insufficient detail when describing key study design features such as randomization (1) and intervention details (2). Second, some data sets may come from studies with no publications [only 21% of oncology trials registered in ClinicalTrials.gov before 2004 and completed by September 2007 were published (3)]. Finally, investigators cannot reliably search journal articles for methodological concepts like “double blinding” or “interrupted time series,” crucial concepts for proper interpretation of the data. A mishmash of non-standardized databases of raw results and unevenly reported study designs is not a strong foundation for clinical research data sharing. “

“ We believe that the effective sharing of clinical research data requires the establishment of an interoperable federated database system that includes both study design and results data. A key component of this system is a logical model of clinical study characteristics in which all the data elements are standardized to controlled vocabularies and common ontologies to

facilitate cross-study comparison and synthesis. “

I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713.

Page 39: UC Berkeley iSchool Nov, 2009

“Increasing levels of coordinate digit noise associated with repeated projection transformations”

Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and Information Content". In Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23, 2005.

Page 40: UC Berkeley iSchool Nov, 2009

“A careful look at the coordinate digits stored as double precision variables in a GIS yields a variety of interesting patterns that are a result of previous machine error, rounding error, measurement error, and so forth. Any slight cartographic alteration (rotation/skewing, clipping/sub-setting, reprojecting, etc.) can add noise into the coordinate and can be used to characterize a vector dataset.”

“It is well known that cartographic coordinates stored in double precision are far more precisely specified than is merited by their accuracy, even for highly-accurate global datasets. Far more coordinate digit places are stored for the sake of avoiding machine error than are needed to define the location of map objects within the necessary tolerances for both absolute and relative accuracies.”

Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and Information Content". In Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23, 2005.

Page 41: UC Berkeley iSchool Nov, 2009

Individual Libraries

Cooperative Projects

National Disciplinary Initiatives

“BIG Science”“Small Science”

Local / Personal Archiving

International Collaborative Research Effort

Individuals

Data Centers

GRIDS

Page 42: UC Berkeley iSchool Nov, 2009

“Small Science”?

Page 43: UC Berkeley iSchool Nov, 2009

The “small science,” independent investigator approach traditionally has characterized a large area of experimental laboratory sciences, such as chemistry or biomedical research, and field work and studies, such as biodiversity, ecology, microbiology, soil science, and anthropology. The data or samples are collected and analyzed independentlycollected and analyzed independently, and the resulting data sets from such studies generally are heterogeneous and unstandardizedheterogeneous and unstandardized, with few of the individual data holdings deposited in public data repositories or openly shared. The data exist in various twilight exist in various twilight states of accessibilitystates of accessibility, depending on the extent to which they are published, discussed in papers but not revealed, or just known about because of reputation or ongoing work, but kept under absolute or relative secrecy. The data are thus data are thus disaggregated components of an incipient network that disaggregated components of an incipient network that is only as effective as the individual transactions that is only as effective as the individual transactions that put it togetherput it together. Openness and sharing are not ignored, but they are not necessarily dominant either. These values must compete with strategic considerations of self-interest, secrecy, and the logic of mutually beneficial exchange, particularly in areas of research in which commercial applications are more readily identifiable.

The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie M. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8

Page 44: UC Berkeley iSchool Nov, 2009

http://www.loc.gov/exhibits/dres/dre123.jpg

Maria Sibylla Merian Metamorphosis insectorum Surinamensium (Metamorphosis of the Insects of Surinam) Amsterdam, 1705, figure 46 Hand-colored engraving (123)

Page 45: UC Berkeley iSchool Nov, 2009

http://www.nyu.edu/projects/materialworld/images/

1_Darwin%20Tree%20B%2036.jpg

http://darwin-online.org.uk/converted/published/1975_NaturalSelection_F1583/1975_NaturalSelection_F1583_fig03.jpg

DARWIN

Page 46: UC Berkeley iSchool Nov, 2009

http://diglib1.amnh.org/cgi-bin/database/index.cgi

FIELD NOTES FROM THE AMERICAN MUSEM CONGO EXPEDITION 1909-1915

Page 47: UC Berkeley iSchool Nov, 2009

http://diglib1.amnh.org/galleries/bats/taphozous_mauritianus.html

Page 48: UC Berkeley iSchool Nov, 2009
Page 49: UC Berkeley iSchool Nov, 2009

Rheinardia ocellata, the Crested Argus. Photographed at night by an automatic camera-trap in the Ngoc Linh foothills (Quang Nam Province).

Courtesy AMNH Center for Biodiversity and Conservation

Page 50: UC Berkeley iSchool Nov, 2009
Page 51: UC Berkeley iSchool Nov, 2009

http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14

By Serge Bloch in NYT: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 SEE:

http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=cse

Page 52: UC Berkeley iSchool Nov, 2009

How many data sources contributed to this analysis?

Page 53: UC Berkeley iSchool Nov, 2009

The “small science,” independent investigator approach traditionally has characterized a large area of experimental laboratory sciences, such as chemistry or biomedical research, and field work and studies, such as biodiversity, ecology, microbiology, soil science, and anthropology. The data or samples are collected and analyzed independentlycollected and analyzed independently, and the resulting data sets from such studies generally are heterogeneous and unstandardizedheterogeneous and unstandardized, with few of the individual data holdings deposited in public data repositories or openly shared. The data exist in various twilight exist in various twilight states of accessibilitystates of accessibility, depending on the extent to which they are published, discussed in papers but not revealed, or just known about because of reputation or ongoing work, but kept under absolute or relative secrecy. The data are thus data are thus disaggregated components of an incipient network that disaggregated components of an incipient network that is only as effective as the individual transactions that is only as effective as the individual transactions that put it togetherput it together. Openness and sharing are not ignored, but they are not necessarily dominant either. These values must compete with strategic considerations of self-interest, secrecy, and the logic of mutually beneficial exchange, particularly in areas of research in which commercial applications are more readily identifiable.

The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. Julie M. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8

Page 54: UC Berkeley iSchool Nov, 2009

Individual Libraries

Cooperative Projects

National Disciplinary Initiatives

“BIG Science”“Small Science”

Local / Personal Archiving

International Collaborative Research Effort

Individuals

Data Centers

GRIDS

Page 55: UC Berkeley iSchool Nov, 2009
Page 56: UC Berkeley iSchool Nov, 2009

Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper,

OECD Publishing. doi: 10.1787/603233448430 http://dx.doi.org/10.1787/603233448430

http://ocde.p4.siteinternet.com/publications/doifiles/publishing-standards-data-2009.pdf

Page 57: UC Berkeley iSchool Nov, 2009

Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper, OECD Publishing. doi: 10.1787/603233448430 http://dx.doi.org/10.1787/603233448430

http://ocde.p4.siteinternet.com/publications/doifiles/publishing-standards-data-2009.pdf

Page 58: UC Berkeley iSchool Nov, 2009

What does “Full Life-Cycle” Data Management Mean ?

Page 59: UC Berkeley iSchool Nov, 2009

US NSF “DataNet” Program“the full data preservation and access lifecycle”

• “acquisition” • “documentation”• “protection” • “access” • “analysis and dissemination” • “migration” • “disposition”

“Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation” NSF 07-601 US National Science Foundation Office of Cyberinfrastructure Directorate for Computer & Information

Science & Engineering

Page 60: UC Berkeley iSchool Nov, 2009

Incentives?

Page 61: UC Berkeley iSchool Nov, 2009

How do we Incentivize Change ?

• Individuals• Professions / Disciplines• Organizations • Institutions (Universities, Research Institutes,

Museums, Gardens, Herbaria, Aquariums, Zoos)• “Memory Institutions” (Libraries, Archives)• Governments• Funders / Sponsors• Publishers!

Page 62: UC Berkeley iSchool Nov, 2009

Individual’s willingness to share: the Core functions of Scholarly Communication

• “Registration, which allows claims of precedence for a scholarly finding.

• “Certification, which establishes the validity of a registered scholarly claim.

• “Awareness, which allows participants in the scholarly system to remain aware of new claims and findings.

• “Archiving, which preserves the scholarly record over time. • “Rewarding, which rewards participants for their

performance in the communication system based on metrics derived from that system.

Roosendaal, H., Geurts, P in Cooperative Research Information Systems in Physics (Oldenburg, Germany, 1997).

Page 63: UC Berkeley iSchool Nov, 2009

Professional / Disciplinary Incentives?

Page 64: UC Berkeley iSchool Nov, 2009

• Norms and standards for sharing vary by discipline • In “big science” (astrophysics / astronomy /

meteorology / oceanography / genomics) sharing is expected (if not required) and contributions to a common fund of knowledge are assumed (See also: GENBANK )

– Standards are relatively clear– Mechanisms for sharing are well-developed– Collective / collaborative authorship is commonplace

• In “small science” such norms are weaker

Page 65: UC Berkeley iSchool Nov, 2009

Small Science: Data Deposit and Access

• Data are typically held in many formats • Discovery of data is very weakly supported by

standards-development• Access to and use of data are highly variable• [ However progress has been made respecting

museum specimen data in the past 20 years [SEE for ex. : GBIF and many allied projects] ]

• Some progress has been made respecting observational and other data

• Ecological and conservation field data remain highly problematic

Page 66: UC Berkeley iSchool Nov, 2009

Some suggestions for action include:

government agencies and private foundations must both set strict requirements for effective sharing – with serious penalties (such as disqualification for future research funding) for failures to share;

• peer review processes must include rigorous scrutiny of past histories of sharing and must require state-of-the-art planning for sharing (not simply a promise to “put data up on the Web” ];

• negotiations for “overhead” (“indirect costs”) compensation from funders must include examination of digital infrastructure adequate for sharing and maintenance of data;

• accreditation bodies for educational institutions and museums must start to require demonstrated evidence of capacity to support digital access and maintenance of data;

• professional societies and professional disciplines must begin to require evidence of effective sharing of data in evaluating credentials for hiring, promotion and tenure;

Page 67: UC Berkeley iSchool Nov, 2009
Page 68: UC Berkeley iSchool Nov, 2009
Page 69: UC Berkeley iSchool Nov, 2009
Page 70: UC Berkeley iSchool Nov, 2009

http://www.mikero.com/blog/2009/02/20/more-darwinhttp://www.zazzle.com/darwin2009

Page 71: UC Berkeley iSchool Nov, 2009

From: Tom Moritz [mailto:[email protected]]Sent: Thursday, November 12, 2009 1:46 AMTo: Donat AgostiSubject: Snapple Real Fact #134: " An ant can lift 50 times its own weight. ”

Is this true?Tom ________________________________________________From: Donat Agosti <[email protected]>Date: Wed, Nov 11, 2009 at 8:03 PMSubject: RE: Snapple Real Fact #134: " An ant can lift 50 times its own weight. "To: Tom Moritz [email protected]

People says so [emphasis added] – but we once looked for the evidence, but could not find a scientific paper confirming this.

D

Page 72: UC Berkeley iSchool Nov, 2009

"They [the hippopotami] present the following appearance; four-footed, with cloven hooves like cattle; blunt-nosed; with a horse's mane, visible tusks, a horse's tail and voice; big as the biggest bull. Their hide is so thick that, when it is dried, spearshafts are made of it.” Herodotus, The Histories (with an English translation by A. D. Godley). Cambridge. Harvard University Press. 1920. LXXI

http://old.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus%3Aabo%3Atlg%2C0016%2C001&query=2%3A71%3A1 [clipped 11/12/09]

Iobi Ludolfi aliàs Leut-holf dicti Historia Æthiopica, sive Brevis & succincta descriptio regni Habessinorum, quod vulgò malè Presbyteri Iohannis vocatur : 2009 Cambridge University Library

Page 73: UC Berkeley iSchool Nov, 2009

“…the great trouble with the world was that which survived was held in hard evidence as to past events. A false authority clung to what persisted, as if those artifacts of the past which had endured had done so by some act of their own will.”

-- Cormac McCarthy The Crossing

a problem with “evidence”…

Page 74: UC Berkeley iSchool Nov, 2009

“Πάντα ῥεῖ καὶ οὐδὲν μένει” Heraclitus: “Everything flows, nothing stands still.”

All data is dynamic

Page 75: UC Berkeley iSchool Nov, 2009

From examination of elephants’ skulls the early Greeks deduced that a species of humanoid Cyclops existed…

(SEE -- for example -- The Odyssey and Ulysses encounter with Polyphemus on the island of Sicily… )

http://www.amnh.org/exhibitions/mythiccreatures/land/greek.php

Page 76: UC Berkeley iSchool Nov, 2009

Another deduction from the evidence of narwhal tusks…“In the Middle Ages, narwhal tusks were widely thought to be unicorn horns with magical, curative properties. Indeed, cups made from narwhal tusks (above) were thought to neutralize poisons and were highly valued. “

http://www.amnh.org/exhibitions/mythiccreatures/land/unicorns.php

Page 77: UC Berkeley iSchool Nov, 2009

Kirtland’s Warbler / Abaco Island, The Bahamas

Page 78: UC Berkeley iSchool Nov, 2009

and 5 CALIFORNIA

CONDORS !!!

DEAD HARBOR SEAL

“NATIVE” METADATA