making agricultural knowledge globally discoverable: are we there yet?
DESCRIPTION
Slides of talk at TGI tutorial series at IFPRI, Washington DC, July 11, 2014.TRANSCRIPT
making agricultural knowledge globally
discoverable (and hopefully usable)
Nikos ManouselisCEO Agro-Knowwww.agroknow.gr
background
An extraordinary company that captures, organizes and adds value to the rich information available in agricultural and biodiversity sciences, in order to
make it universally accessible, useful and meaningful.
http://www.agroknow.gr
Our way of doing things
We put our people at our focus
We have a culture of shared, co-defined values
We are based on trust and transparency
We see beyond profit by serving our users and customers so that they create societal impact
We develop and put in real practice solutions that transform data into meaningful knowledge
and services
We help people solve problems
informed by data
Unorganized Content in local and remote sites
Widgets
Authoring services
Data Discovery Services
Analytics services
Data Platform
Ingestion Translation Publication
Harvesting BlossomCultivation
Organized and structured Content in local and remote
DBs
Educational
Bibliographic
Other
Enrichment
Aggregate data from diverse sources
Works with different type
of data
Prepare data for
meaningful services
Educational
Bibliographic
data aggregation & sharing solutions
working with high profile partners & clients
• Food and Agriculture Organization (FAO) of the United Nations
• World Bank Group• UK’s Dept for International Development (DFID)• Michigan State University (MSU)• Wageningen University & Research (WUR)• French Institute of Agricultural Research (INRA)• Creative Commons
large scale data-related projects• agINFRA: a data infrastructure to support agricultural scientific
communities (2011 -now)– EU, $5.2M, 12 partners (incl. FAO); tech coordinator, evaluation, sustainability– in G8 Open Data in Agriculture Action Plan for Europe
• SemaGrow: Data intensive techniques to boost the real-time performance of global agricultural data infrastructures (2012 - now)– EU, $3.1M, 8 partners (incl. FAO, WUR); tech coordinator, evaluation,
sustainability– in G8 Open Data in Agriculture Action Plan for Europe
• Organic.Lingua: Demonstrating the potential of multilingual Web Portal for Sustainable Agricultural & Environmental Education (2011-2014)– EU, $2.4M, 11 partners (incl. INRA); tech+data coordinator, evaluation
data interoperability work
• Agricultural Interoperability Interest Group (IG) at Research Data Alliance (RDA)
• Database Subgroup, Knowledge & Learning Systems Group, Global Food Safety Partnership (GFSP)
context
“Knowledge is the engine of our economy. And data is its fuel”
Neelie Kroes, Vice President of the European Commission
http://ec.europa.eu/digital-agenda/en/news/economic-and-social-benefits-big-data
“By improving our ability to extract knowledge and insights from large and complex collections of digital data, the initiative promises to help solve some the Nation’s most pressing challenges.”Big Data Research & Development Initiativehttp://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf
policy• USA’s National Research Council on Ensuring
the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age– “researchers to make all research data,
methods, and other information underlying results publicly accessible in a timely manner– “the stewardship of research data is a critical
long-term task for the research enterprise and its stakeholders”
http://www.nap.edu/catalog.php?record_id=12615
internationally• joint USA, EU, Australia, Research Data
Alliance (RDA) vision– “researchers and innovators openly sharing
data across technologies, disciplines, and countries to address the grand challenges of society”
https://rd-alliance.org/about.html
CIARD’s manifesto• “towards a Knowledge Commons on
Agricultural Research for Development”• “agricultural knowledge is freely accessible
and contributes to reducing hunger and poverty”
• “open knowledge makes it easier to provide better solutions”http://www.ciard.net/about/manifesto
GODAN’s statement of purpose• “support global efforts to make agricultural and
nutritionally relevant data available, accessible, and usable for unrestricted use worldwide”
• “advocate for the release and re-usability of data in support of Innovation and Economic Growth, Improved Service Delivery and Effective Governance, and Improved Environmental and Social Outcomes”http://godan.info/statement.html
IFPRI & open access• “…research is an international public good, that
should be freely disseminated to the extent possible…”
• “IFPRI is committed to the principle of free access to the knowledge it generates”
CGIAR & open access
• “CGIAR regards the results of its research and development activities as international public goods and is committed to their widespread dissemination and use to achieve the maximum impact to advantage the poor…”
agricultural knowledge: globally accessible?
a “good enough” case study
agricultural bibliography• bibliography on agricultural sciences• several efforts in putting together
(aggregating/indexing) metadata records on agricultural publications & grey literature
• FAO’s AGRIS service: a prominent example– quite advanced data ingestion workflow &
infrastructure– semantic backbone with AGROVOC as LOD & triple
store with all aggregated records– more than 7.5 million publications indexed & made
discoverable
elaborated, automated workflow
Metadata harvester
Filtering component
Stores
File system (DC, IEEE LOM, MODS XML)
File system (DC, IEEE LOM, MODS XML)
Stores
Identification and de-duplication component
MySQL
Duplicates
Stores
Transformation component ( to AKIF)
Store metadata in JSON (Internal Format)
Link checking component
PostProcessing/Enrichment component
File system (XMLs)
Get unique ID
Records with
Broken Links
Indexing mechanismAPI
AGRIS search service
results mashing up more info
similar/relevant efforts
• PubAg: forthcoming service by National Agricultural Library (NAL) for discovering USDA publications – and beyond
• LGU community of ag knowledge: forthcoming service federating institutional repositories of Land Grant Universities
• CGIAR open: (to be) federating & providing access to all CG center repositories
• …and more to come
but we are not there yet
a) each initiative replicating technical & data processing effort (harvesting, transforming, indexing…)
b) coverage is not complete – transferring the discovery problem to the level of aggregators
c) still not focusing on the needs of each specific subject, group, region, project, …
d) agriculture is multi-disciplinary: relevant publications may be found in other domains (health, economics, environment, … )
agricultural knowledge: globally accessible?
a more demanding case study
CSPI• the organized voice of the American public on
nutrition, food safety, health and other issues– “improve food safety laws and reduce the incidence of
foodborne illness” • has tracked foodborne illness outbreaks since 1997– events where two or more people become ill from
eating the same food – outbreaks where both the food and pathogen can be
identified
US Outbreak Alert Database (until 2011)
http://cspinet.org/foodsafety/outbreak/pathogen.php
US Outbreak Report (after 2011)
http://cspinet.org/foodsafety/outbreak_report.html
Safe Food International
http://regionalnews.safefoodinternational.org
data sources of interest• CDC - Foodborne Outbreak Online Database (FOOD)– http://wwwn.cdc.gov/foodborneoutbreaks/
• ProMED mail – http://www.promedmail.org
• Kansas FS-net – blogging at http://barfblog.com – posting news at http://bites.ksu.edu – archive at http://www.safefoodhandler.com/fsnet.htm
• Project TYCHO– https://www.tycho.pitt.edu
some of the challenges
a) time-consuming & laborious primary data identification and documentation (by hand)
b) not complete coverage: incomplete & problematic data collection and sharing
c) multiple & outdated databases for secondary/processed data storage and curation
d) time-consuming & expensive processed data visualization & publication
improving curation of data• focus on making data documentation,
storage, management easiera) migrate existing multiple databases in single
data repositoryb) improve data organization & classification
schemes (e.g. by pathogen, food, geographical location, time reported, …etc)
c) improve data curation & filtering workflows (document & store data once, feed multiple sites/access points; US vs. international sites)
modernize outbreak data repository
advanced data organisation & classification
use single data repository for all CSPI sites
improving discovery & processing• focus on foodborne illness outbreak reports &
product recallsa) automate as much as possible workflow of
reports’ processing (feeding directly into CSPI data repository)
b) extend coverage of data types (include food product recalls)
c) extend coverage of data sources (include more sites with outbreak reports & product recalls)
auto extract structured data from text
include & link to food recall data
include waterborne illness data
add more (relevant) data sources
improving visualization & publication• focus on making processed & validated data
accessible immediately onlinea) automate as much as possible workflows for
generating filtered reports (feed diagrams & tables for CSPI publications, present directly online through CSPI & SFI web sites)
b) offer opportunities for public to interact with data online (play with parameters and generate new data reports & visualizations)
c) share data openly for research, education and awareness through CSPI & SFI web sites)
enhance search/discovery of data
Landing page Search and filter page View details and access page
use of advanced data visualizations
allow users to customize data reports
provide multi-channel access to data
shaping a more big & hairy goal…
let’s imagine that• we have an very big, open, scalable platform
that…– …will catalog all relevant information entities– …will make all information machine readable and discoverable– …will allow information providers express how, with whom, under
which license and for which purposes they share this info– …will help people utilize the collective power of information to
solve more societal challenges, better– …will make funding & resource use transparent for donors and
the public– …will coordinate, consolidate and harmonize data & technology
sharing among agri-food sectors and user communities
for example: CIARD RING
catalogues (some) data
catalogues (some) solutions
catalogues (some) organisations
could federate & include: more data
could federate & include: more software
could federate & include: donors
could federate & include: funding
scale up, per federated info type
Meta-registry platform federating all existing registries & making information discoverable
Registries of data sources
Federated data registry
Federated information providers
Registries of organisations’ catalogs
Federated org registry
Registries of software apps/components
Federated solution registry
…etc
evolving technology further
HARVESTER
OAI-PMH Service Provider #1
Schema #1
OAI-PMH Service Provider #n
Schema #n
INDEXER
AggregatedXML Repository
Web Portals
Open AGRIS (FAO)AgLR/GLN (ARIADNE)Organic.Edunet (UAH)
VOA3R (UAH)...
AGRIS AP Schema
IEEE LOM Schema
DC Schema
...
RDF Triple Store
Common Schema
SPARQL endpoint(Data Source #1)
SPARQL endpoint(Data Source #n)
INDEXER
Web Portals
SPARQL endpoint
NOW (2012) CASE OF AGRICULTURAL INFRASTRUCTURES 2015 (AgINFRA) CASE OF AGRICULTURAL INFRASTRUCTURES
How Many?
Big Data Problem!
Is it feasible?
http://semagrow.eu
wrapping up
which are the real problems that we are trying to solve?
information & technology are just enablers