broad data

46
Tetherless World Constellation Broad Data Jim Hendler Tetherless World Constellation Tetherless World Professor of Computer and Cognitive Science Director, Information Technology and Web Science Program Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)

Upload: james-hendler

Post on 07-May-2015

5.182 views

Category:

Technology


1 download

DESCRIPTION

In this talk I compare "Broad" data, the idea of thousands of datas

TRANSCRIPT

Page 1: Broad Data

Tetherless World Constellation

Broad Data

Jim HendlerTetherless World Constellation

Tetherless World Professor of Computer and Cognitive ScienceDirector, Information Technology and Web Science Program

Rensselaer Polytechnic Institutehttp://www.cs.rpi.edu/~hendler

@jahendler (twitter)

Page 2: Broad Data

Tetherless World Constellation

Outline (if I stick to it)

• Big Data ≠ Broad Data• Broad Data problem• Broad Data Example

– Open Government Data

• Broad Data challenges• How can you make money off this

stuff?

Page 3: Broad Data

Tetherless World Constellation

BIG Data

• The term “Big Data” is widely used nowadays– 3 main contexts

• The large data collections of “big science” projects

• The data holdings of a Google, Facebook or other large Web company

• The enterprise data of large, non-Web-based companies (IBM, TATA, etc.)

Page 4: Broad Data

Tetherless World Constellation

Big Data Challenge: Scaling

• Most of the focus of (current) Big Data research is on scaling (traditional) database-related technologies– Schema Modeling – Data Warehousing– Datamining– Statistical analysis– Mathematical Analytics – …

Page 5: Broad Data

Tetherless World Constellation

How BIG is Big?

• Science uses some extremely large databases and many of them are crucial to society– Petabytes of Data

• World Wide Web data is also extremely large– With primary resources to explore it held by

companies• eg. Facebook

– 25 Terabytes of logged data per day; valuation $100B?

• eg. Google– In 2008 it was estimated at 20 petabytes per day (not including

youTube); 2010 valuation >$190B

Page 6: Broad Data

Tetherless World Constellation

Big Data

Facebook generates terabytes of data per dayWhat could be learned from this?

Page 7: Broad Data

Tetherless World Constellation

BIG Data

Google uses their data in many waysSearch => ads => user

Page 8: Broad Data

Tetherless World Constellation

Big Data is becoming different on the Web

• New Work– is moving away from traditional relational

models • cf. NoSQL

– Moving towards third party application and extension

• cf. Mobile apps for local governments

– Includes a focus on interoperability and exchange with “lightweight” semantics

• Using ideas from the Semantic Web– Search: Schema.org – Social Networking: OGP

Page 9: Broad Data

Tetherless World Constellation

BROAD data

• 4th context: Broad Data – The huge amount of freely available, but widely varied,

Open Data on the World Wide Web (Structured and Semi-structured)

• Example: The extended Facebook OGP graph (the part outside Facebook’s datasets)

• Example: The growing linked open data cloud of freely available RDF linked data

• Example: More than 710,000 datasets that are available on the Web free from governments around the world

Page 10: Broad Data

Tetherless World Constellation

Example: adding “Breadth”

April 2010

Page 11: Broad Data

Tetherless World Constellation

Facebook’s Open Graph Protocol

• Facebook now allows other sites to extend the graph • Open Graph Protocol uses RDFa to let web sites contain

information about the things people “like”og:title - The title of your object as it should appear within the graph, e.g., "The Rock".og:type - The type of your object, e.g., "movie". Depending on the type you specify, other properties may also be required.og:image - An image URL which should represent your object within the graph.og:url - The canonical URL of your object that will be used as its permanent ID in the graphog:description - A one to two sentence description of your object.og:site_name - If your object is part of a larger web site, the name which should be displayed for the overall site. e.g., "IMDb".

– Not a traditional “ontology”

Page 12: Broad Data

Tetherless World Constellation

OGP use growing quickly

15,178 sites of top 1,000,000 as of 3/3/11

In Sept 2012 Facebook announced extension of OGP for new uses

Page 13: Broad Data

Tetherless World Constellation

Goal: OGP-powered social (e-commerce) apps

Page 14: Broad Data

Tetherless World Constellation

Broad data (in Science)

• The “Deep Web” in Science (cf. Fox 2011)

– Data behind web services– Data behind query interfaces (databases or

files)

• Introduces a different curation problem

14

Page 15: Broad Data

Tetherless World Constellation

Broad Data Science

(Fox &Hendler, Science, 2/11/10)

Page 16: Broad Data

Tetherless World Constellation

BROAD data challenges

• For broad data the new challenges that emerge include – (Web-scale) data search – “Crowd-sourced” modeling– rapid (and potentially ad hoc) integration of

datasets– visualization and analysis of only-partially

modeled datasets– policies for data use, reuse and

combination.

Page 17: Broad Data

Tetherless World Constellation

Example: Government Data on the Web

Page 18: Broad Data

Tetherless World Constellation

Government Data Sharing: “Year 1”Ja

nu

ary

1,

20

09

“Openness will strengthen our democracy and promote efficiency and effectiveness in Government.”

--- President Obama

Putting Govt Data online-Data.gov.uk beta

Ma

y 2

1,

20

09

Jan

ua

ry 1

9,

20

10

data.gov.uk online

Ma

y 2

1,

20

10

data.gov online data.gov relaunchwith semantic webfeatured

Jun

e3

0,2

00

9

De

cem

be

r 8

, 2

00

9

“Open GovernmentDirective” released

2009 2010 …

57 Data Sets

~6000 Data Set

~2000 Data Sets

>305,000 Data Sets

Page 19: Broad Data

Tetherless World Constellation

Government Data Sharing: Year 2

Page 20: Broad Data

Tetherless World Constellation

Government Data Sharing: Year 3

2012 so far:

http://www.gouv.frReleased 300,000 French databases

US/India to release Open Government Platform

Kenya announces “Open Africa” project

Page 21: Broad Data

Tetherless World Constellation

Government Data in the linked open data cloud

http://linkeddata.org/

Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)

Page 22: Broad Data

Tetherless World Constellation

Important to the citizens: eg. Education

Data.gov.ukRPI NYS demos

Page 23: Broad Data

Tetherless World Constellation

Government “Data” Mashups

Page 24: Broad Data

Tetherless World Constellation

Data.gov + epa.gov

Page 25: Broad Data

Tetherless World Constellation

Page 26: Broad Data

Tetherless World Constellation

Linking GDP of the US and China

GDP of China (Billion Chinese Yuan )

GDP of the US (Billion Dollar)

[Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn

Page 27: Broad Data

Tetherless World Constellation

Linking GDP of the US and China

GDP of China (Billion Chinese Yuan )

GDP of the US (Billion Dollar)

[Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn

This mashup was built in less than 4 hours – including conversion of data, web interface, and visualization!

Page 28: Broad Data

Tetherless World Constellation

Linking to “context” important

Datasets: acres burned, and agency budgetsDbpedia: wikipedia descriptions of major US fires

Page 29: Broad Data

Tetherless World Constellation

Integrate with Social media

Page 30: Broad Data

Tetherless World Constellation

Combining data from different data sharing sites

Page 31: Broad Data

Tetherless World Constellation

http://logd.tw.rpi.edu demos, tutorials, RDF-ized datasets, and more

Page 32: Broad Data

Tetherless World Constellation

Broad Data “Integration”requires simple semantics

Page 33: Broad Data

Tetherless World Constellation

Example any wikipedia topic!

Page 34: Broad Data

Tetherless World Constellation

Metadata is crucial for Broad Data

• Metadata design is crucial to govt data sharing– Needed for search and federation in large data

sharing efforts

• International data sharing – W3C Govt Linked Data Working Group

– Need for vocabularies within govt sectors• Esp for cross-langauge use

– How can we compare health (or legal, or social, or ….) data between countries like US, UK, India, Kenya (English) with Norway, China, France, etc.

– How can we link local govts (in traditional languages, local dialects, etc) w/national data

Page 35: Broad Data

Tetherless World Constellation

International Open Government Data Search

Page 36: Broad Data

Tetherless World Constellation

Searching for data

• Faceted browser with– Keyword search– Catalogs– Countries– Agencies– Categories– (in any order)

Page 37: Broad Data

Tetherless World Constellation

Details and download…

http://logd.tw.rpi.edu/demo/international_dataset_catalog_search

Page 38: Broad Data

Tetherless World Constellation

Research in Govt Data => Broad Data challenges

• Trust– Government data is controversial, and potentially biased

• How do we confirm or dispute?

• Combination– When we combine data we need to keep the provenance of information

(see trust)• How do we make policies explicit and sharable

• Scaling– Our project has already converted 9.9B triples from only >2,000 of the

710,000 government databases we can identify (116 catalogs, 32 countries, 16 languages)

• Cross-catalog• Cross Langauge

• Versioning and updating • Archiving• Visualization• …

Page 39: Broad Data

Tetherless World Constellation

Exploring new visualizations

Data from http://littlesis.org

Page 40: Broad Data

Tetherless World Constellation

Reaching beyond the government

Page 41: Broad Data

Tetherless World Constellation

Broad Data Goes Beyond the Govt

http://linkeddata.org/

Page 42: Broad Data

Tetherless World Constellation

Broad Data Challenges

• Finding and Using Broad Data is an emerging challenge– How do I find a dataset in the many out there that

might be of use to me?• Cannot keyword search in data

– How do I know what is in a large data store? In the cloud?

• What is the coverage?• What is the access? • Who do I need to ask for what

– What are the rules about using it?• What can I combine it with?• How do downstream users know I’ve combined it

Page 43: Broad Data

Tetherless World Constellation

Broad Data Market?

• Significant and growing commercial interest…– Web: Google, Amazon, Travelocity…– Web 2.0: Facebook, Wikipedia,

YouTube, Twitter…– Web 3.0: ??

Page 44: Broad Data

Tetherless World Constellation

Broad Data Market?

• Significant and growing commercial interest…– Web: Google, Amazon, Travelocity…– Web 2.0: Facebook, Wikipedia,

YouTube, Twitter…– Web 3.0: ??

Broad Data Goes Here

Page 45: Broad Data

Tetherless World Constellation

Research (and business) Opportunities

• Broad Data is a great field for those looking for emerging opportunities– Tooling is needed– (Business) Models are just starting to emerge– Scalability Infrastructure is there– Massive Distribution (think mobile) is wide

open to Web 3.0 innovation

• Govt data gives us a place to cooperate (with public good) while exploring all of the above

Page 46: Broad Data

Tetherless World Constellation

Conclusions

• Big data is going Broad– World Wide Web trend towards more and more

varied data• In many domains

– E-commerce, Open Govt, many more (cf. Health/Medical care)

• Broad data requires thinking outside the “Database” box

• Broad data opens exciting possibilities for research and innovation– Come play!