fabrizio gagliardi emea & latam director technical computing msr external research microsoft...

26
Semantic Application for Digital Repositories Fabrizio Gagliardi EMEA & LATAM Director Technical Computing MSR External Research Microsoft Corporation

Post on 20-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Semantic Application forDigital Repositories

Fabrizio GagliardiEMEA & LATAM Director

Technical Computing MSR External ResearchMicrosoft Corporation

• Advancement of Science• Global Collaboration

• Technology Excellence• Interoperability

Microsoft Research’s Commitment to Science

Putting computing into science…Applying Microsoft products and research technologies to advance the scientific research and engineering innovation process

Putting science into computing…Ensuring that research community requirements are factored into future versions of Microsoft software

Scholarly Communications: Project Overview• Current or Completed Projects

o Cornell – arXiv.org + Word 2007 (and repository interoperability via SWORD) o MIT / Broad Institute – Authoring (Word 2007) + data for research reproducibility o MSR – CMT++ interoperability with data + metadata transfer/exchange (conference management tool

enhancements) o LiveLabs – eJournal publishing online service (community publishing tool)o UC San Diego / PLoS – Semantic mark-up of scholarly articles (+ submission)o Chem4Word with Office & Cambridge University – Create add-in to Word 2007 to facilitate drawing of

chemical compounds and equations o Johns Hopkins University – Digital Archive for Astronomy/Astrophysics data (storage, preservation and

access) o Planets Project / EU (with MSR – Cambridge) OpenXML and file format preservation + interoperabilityo eChemistry Project (Cornell, Penn State, Indiana, Cambridge, Southampton) – ORE exemplar: access

to compound chemical info objects (cross-repository access to open chemistry data)o British Library – Researcher Information Centre (RIC) online workflow tool for scientists and

researcherso Creative Commons Add-in for Office 2007 – evolving the Word 2003 efforto University of Southampton (UK) – Port ePrints Repository Software for installation on the Windows

platformo University of Manchester / “MyExperiment” Project – social networking for scientists o ORE Acceleration Project (OAI – Object Reuse & Exchange) – Alpha spec developmento Indiana University – Toolbox for Social Networking (SRT)o UK National Archives – Virtual PC / Emulation of legacy systems to facilitate preservationo National Library of Medicine / NCBI – “PubMed Int’l” UK version of PubMed + NLM DTD

• Pipelineo DRIVER 2 (EU) – Infrastructure integration of across a network of European research repositories

Research Output Repository Platform

Goals• A platform for building services and tools for research

output repositories• Papers, Videos, Presentations, Lectures,

References, Data, Code, etc.• Relationships between stored entities

• Enable a tools and services ecosystem for “research output” repositories on MS technologies

Execution• Utilizing OAI-ORE, SWORD, and other

community protocols• In development, deployment within MSR in early Q4• Beta release to the community in late Q4• Built on SQL Server 2008 + Entity Framework

• Using WPF and Silverlight for UI

Research

output reposito

ry platfor

m

Goals• Create a platform for building

“research output” repositories• Engage with the digital library and

scholarly communications community

• Become the “research output” repository for MSR (RMCr project)

– Papers, Videos, Presentations, Lectures, References, Data, Code, etc.

• Support an ecosystem of services and tools

• Available to the community for free (we are still considering the open source route)

• Build an easy-to-install collection of basic services and tools

Non-goals• A generic platform for asset

management• Support the lifecycle of publications• Compete with existing repository

solutions

Research Output Repository Platform

Services/tools

Microsoft.Famulus.Framework

Microsoft.Famulus.Core(Based on the Entity Framework Model + extensions)

SQL Server 2008, MS data storage technologies, Entity Framework runtime

Researchers manage their personal research entities(data, citations, documents, workflows, etc.)

Entities + Relationships can be synched to cloud storage so that they are:

- Always Available - Sharable - Mixable - Harvestable

An Ecosystem of Research RepositoriesSupport of harvesting & federation to/from Institutional Repositories

- arXiv.org - DSpace - ePrints - Fedora - etc.

• Limit Tech Preview release due June 2008• Public Beta targeted for Aug/Sept 2008

For more details– Contact:

• Alex Wade (Program Manager) / [email protected]

– Community Forum:• http://community.research.microsoft.com/forums/90.aspx

Current Project Status

eScience and Semantic Computing meet the Cloud

The cyberinfrastructure for the next generation of researchers

9

• Expect scientific research environments will follow similar trends to the commercial sector– Leverage computing and data storage in the cloud– Scientists already experimenting with Amazon S3 and EC2

services, with mixed results;• For many of the same reasons

– Siloed research teams, no resource sharing across labs– High storage costs– Low resource utilization– Excess capacity– High costs of reliably keeping machines up-to-date– Little support for developers, system operators

The Future: Software plus Services for Science?

• Collective intelligence– If last.fm can recommend what song to broadcast to me

based on what my friends are listening to, why cannot the cyberinfrastructure of the future recommend articles of potential interest based on what the experts in the field that I respect are reading?

– Already examples emerging but the process is manual (Connotea, BioMedCentral Faculty of 1000 ...)

• Automatic correlation of scientific data• Smart composition of services and functionality• Cloud computing to aggregate, process, analyze and

visualize data

A smart cyberinfrastructure

• Important/key considerations– Formats or “well-known” representations

of data/information– Pervasive access protocols are key (e.g.

HTTP)– Data/information is uniquely identified

(e.g. URIs)– Links/associations between

data/information

• Data/information is inter-connected through machine-interpretable information (e.g. paper X is about star Y)

• Social networks are a special case of ‘data networks’

A world where all data is linked…

Attribution: Richard Cyganiak

…and stored/processed/analyzed in the cloud

scholarly communications

domain-specific services

The Microsoft Technical Computing mission to reduce time to scientific insights is exemplified by the June 13, 2007 release of a set of four free software tools designed to advance AIDS vaccine research. The code for the tools is available now via CodePlex, an online portal created by Microsoft in 2006 to foster collaborative software development projects and host shared source code. Microsoft researchers hope that the tools will help the worldwide scientific community take new strides toward an AIDS vaccine. See more.

instant messaging

identity

document store

blogs &social networking

mail

notification

searchbooks

citations

visualization and analysis services

storage/data services

computeservices

virtualization

Project management

Reference management

knowledge management

knowledge discovery

Vision of Future ResearchEnvironment with bothSoftware + Services

Thanks you for your attention

• Thousand years ago – Experimental Science– Description of natural phenomena

• Last few hundred years – Theoretical Science– Newton’s Laws, Maxwell’s Equations…

• Last few decades – Computational Science– Simulation of complex phenomena

• Today – eScience or Data-centric Science– Unify theory, experiment, and simulation – Using data exploration and data mining

• Data captured by instruments• Data generated by simulations• Data generated by sensor networks

– Scientists overwhelmed with data– Computer Science and IT companies

have technologies that will help

(With thanks to Jim Gray)

Emergence of a New Research Paradigm?

2

22.

3

4

a

cG

a

a

Web users...• Generate content on the Web

– Blogs, wikis, podcasts, videocasts, etc.

• Form communities– Social networks, virtual worlds

• Interact, collaborate, share– Instant messaging, web forums,

content sites

• Consume information and services

– Search, annotate, syndicate

Scientists...• Annotate, share, discover data

– Custom, standalone tools

• Conferences, Journals– Publication process is long,

subscriptions, discoverability issues

• Collaborate on projects, exchange ideas

– Email, F2F meetings, video-conferences

• Use workflow tools to compose services

– Domain-specific services/tools

Today

Data can be easily produced

http://ecrystals.chem.soton.ac.uk

Thanks to Jeremy Frey

Data and services can be easily composed

SensorMapFunctionality: Map navigationData: sensor-generated temperature, video camera feed, traffic feeds, etc.

A

D

CB

ETaverna Workflow

Compose services from the Web

Data is easily accessible

With thanks to Catharine van Ingen

Data is easily shareable

Sloan Digital Sky Server/SkyServerhttp://cas.sdss.org/dr5/en/

Today…

storing computing

managing indexing

huge amountsof data

For example, Google and Microsoft both have copies of the Web for indexing purposes

Computers aregreat tools for

Tomorrow…

acquisition discovery aggregati

on

organization

correlation analysis

interpretation inference

We would likecomputers to also

help with theautomatic

of the world’sinformation

storing computing

managing indexing

huge amountsof data

Computers will stillbe great tools for

Semantic Computing

• Set of concepts and technologies– Data modeling– Relationships– Ontologies– Machine learning (entity extraction)– Inference, reasoning– Data, information, knowledge…

What is Semantic Computing?

Data

Information

Knowledge

Intelligenc

e

WisdomCurrent technologies

Possibilities for innovation

• Term used to refer to the concept of “meaning”• The linguistics, AI, Natural Language Processing,

etc. communities have been working on “meaning” and ”knowledge” related technologies for decades

• Pragmatic approach to Semantic Computing– Emergence of a new breed of technologies to capture

meaning (RDF, OWL, etc.)– Combine with the pervasiveness of the Web

community technologies such as folksonomies …

Semantics

• The term is used to describe a set of technologies used to represent data, concepts, and their relationships– Become a buzzword like Web 2.0

• Prefer to use the term “Semantic Computing” which is about modeling data in ways that can be automatically processed by computers

A word about the “Semantic Web”

• Some efforts are driven by the traditional “knowledge engineering” community– Engaged in building well-controlled ontologies– Important for domain-specific vocabularies with data formats

and relationships specific to a community– Model does not easily scale to the Internet

• Some efforts are driven by the Web 2.0 community– Focus on the pervasiveness of Web protocols/standards– Emphasis on microformats (small, flexible, embeddable

structures)– Exploit evolving and ever-expanding vocabularies such as

folksonomies and tag clouds

Semantic Computing