hamish james statistics new zealand open data and data curation
TRANSCRIPT
Defining data
Data consists of sets of structured values that can be organised, analysed and manipulated by a software application or some other means of calculation. This includes data collected directly through surveys and administrative systems, as well as data created or compiled by aggregating or reanalysing other sources. A defining characteristic of data is that it is machine-readable.
Open data, data curation
Open data is a philosophy based on the idea that that data is more valuable if more people can use it, and that technology has made the cost of sharing data negligble
Data curation is a field of research and work focusing on the long-term management of data, built on the argument that the opportunity cost of losing data is high
Open data highlights benefits Data curation worries about costs
Focus of open data activities
• Data collected and held by governments
• Data collected or generated through publically funded research
• http://wiki.opengovdata.org/index.php?title=OpenDataPrinciples
Reasons to make data open
• The underlying purposes of making publically funded data more accessible are to:• inform decision making by government, businesses and
communities
• increase transparency and accountability in government decision making
• assist informed participation by the public in government decision making
• promote economic development through the innovate application of data collected for one purpose to other tasks
• gain greater value from research data
Barriers to reuse of government data
Agency culture (reluctance or hostility to data sharing)
Funding constraints Ensuring data confidentiality Shared ownership Poor dissemination practices
Open Government Data Principles
• Government data shall be considered open if it is made public in a way that complies with the principles below:
1. Complete. All public data is made available. Public data is data that is not subject to valid privacy, security or privilege limitations.
2. Primary. Data is as collected at the source, with the highest possible level of granularity, not in aggregate or modified forms.
3. Timely. Data is made available as quickly as necessary to preserve the value of the data.
4. Accessible. Data is available to the widest range of users for the widest range of purposes.
5. Machine processable. Data is reasonably structured to allow automated processing.
6. Non-discriminatory. Data is available to anyone, with no requirement of registration.
7. Non-proprietary. Data is available in a format over which no entity has exclusive control.
8. License-free. Data is not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.
Characteristics of open data
Open data: Free and open access to the data Freedom to redistribute the data Freedom to reuse the data No restriction of the above based on who someone
is (e.g. their nationality) or their field of endeavour (e.g. commercial or non-commercial)
c.f. http://www.okfn.org/about/
Creative Commons
Attribution
Share-alike
No derivative works
Non-commercial
Creative Commons licence conditions
Linked data
• Linked data uses semantic web approaches (especially RDF) to describe data and make it accessible to machines – a web of linked data
• RDF ‘triples’ are used to describe things• Subject – predicate – object
• Hamish – is a – presenter
Examples
“Which town or city in the UK has the highest proportion of students?"
“Which town or city in the UK is home to one or more university campuses whose registered full or part time (non-distance) students divided by the local population gives the largest percentage?”
http://digitalcuration.blogspot.com/2010/03/linked-data-and-reality.html
render explain
re/use
Documentation:• Standards• Meaning• Interpretation
Technology:• Hardware• Formats• Software
What is missing? Context
• Data is not self-describing
• Who provides the description?
• What does it cost to provide the description?
• How much of the description is held as tacit knowledge?• Expert’s personal knowledge
• Rules and meaning encoded into the data and software
Data curation
• Data curation involves:• Data management
• Adding value to data
• Data sharing for re-use
• Data preservation for later re-use
http://www.dcc.ac.uk/news/what-makes-data-curation
= open data = data curation
Open data brings benefits and risks
open data
more users
highlights data
curation failures
justifies data
curation costs
pressure for more
user support
expands expert
community
increases risk of poor
analysis
Complementary ideas
• Actively curated data will:• Remain technologically accessible
• Be easier to understand (and therefore use)
• Data curation will benefit from data being made more open:• Data that is in active use tends to remain usable
• Widely used data is better understood than isolated data