structured data in web search

Post on 10-May-2015

747 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

For the first time since the emergence of the Web, structured data is playing a key role in search engines and is therefore being collected via a concerted effort. Much of this data is being extracted from the Web, which contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the Web provides a platform that encourages publishing more data sets from governments and other public organizations. The Web also supports new data management opportunities, such as effective crisis response, data journalism and crowd-sourcing data sets. I will describe some of the efforts we are conducting at Google to collect structured data, filter the high-quality content, and serve it to our users. These efforts include providing Google Fusion Tables, a service for easily ingesting, visualizing and integrating data, mining the Web for high-quality HTML tables, and contributing these data assets to Google's other services. Alon Halevy heads the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the database group. In 1999, Dr. Halevy co-founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic, a company that created search engines for the deep web, and was acquired by Google. Dr. Halevy is a Fellow of the Association for Computing Machinery, received the the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999-2000). He received his Ph.D in Computer Science from Stanford University in 1993 and his Bachelors from the Hebrew University in Jerusalem. Halevy is also a coffee culturalist and published the book "The Infinite Emotions of Coffee", published in 2011 and a co-author of the book "Principles of Data Integration", published in 2012.

TRANSCRIPT

Structured Data on the Web

Alon HalevyGoogle

May 23, 2014

Joint work with: Jayant Madhavan, Cong Yu, Fei Wu, Hongrae Lee, Warren ShenAnish Das Sarma, Rahul Gupta, Boulos Harb, Zack Ives, Afshin Rostamizadeh, Sree Balakrishnan, Anno Langen, Steven Whang, Mohamed Yahya, and others

Structured Data in Search Results

Set QueriesChicago restaurants

Association Queries

Data in Movies!

The Knowledge Graph

Knowledge Graph

Brazil

Brasiliacapital

population2014

2001

mayor

Query Reformulation

Knowledge Graph

Brazil

Brasiliacapital

population2014

2001

mayor

Brazil capitalWhat is the capital of

Brazil“Google, tell me the

capital of brazil”

Brazil nuts Culture of Brazil “Google, will Brazil

win the world cup?”

Other Sources of Data

Knowledge Graph

Brazil

Brasiliacapital

population2014

2001

mayor

Brazil capital

The population of Brasilia is 2207718 according to the GeoNames geographical

database

Tables Text

Answer Queries Directly from Web?

Brazil capital

The population of Brasilia is 2207718 according to the GeoNames geographical

database

Tables Text

Knowledge Graph

Brazil

Brasiliacapital

population2014

2001

mayor

The Web vs. the Knowledge Graph

Tables, Tables

Brazil capital

The population of Brasilia is 2207718 according to the GeoNames geographical

database

Tables Text

Knowledge Graph

Brazil

Brasiliacapital

population2014

2001

mayor

Fusion Tables: Enabling a broad range of users to create tabular content

WebTables: Finding good HTML tables on the Web

• City planning

• Sustainability: water, coffee, …

• Crisis response

• Advancing public discourse (e.g., gun control)

• Data philanthropy – corporations encouraged to contribute data to the good of society.

Background for Coffee Examples

Fusion Tablesgoogle.com/fusiontables

[SIGMOD 2010, SIGMOD 2012]

• Goal: an easy-to-use database system that is integrated with the Web.

• Key: support common workflows– Easy upload (CSV, KML, spreadsheets)– Sharing (even outside your company)– Visualizations front and center– Easy publishing

• Goal 2: Fusion in the data cloud -- discover others’ data and combine with yours.

Coffee Producing Countries

Coffee Consumption Per Capita

Big Data for Regular People

Table Facts:

English poverty rates:32,000 wards with a total of 1.8 million verticesColors indicate poverty levels

2011 Rioting:2100 incidentsColors indicate addresses of Rioting and Rioters

Best UK Internet Journalist

Knight-Batten Award for Innovations in Journalism

Crowd Sourcing

Data Integration as Search

Join with Population Data:What is a City?

Big Data Integration

Table Facts:

Texas Counties 2010 Census:254 counties with 543000 verticesColored based on various demographics

See SIGMOD 2012 paper for details on scaling map visualizations

Crowdsourcing Cafes

HTML Tables

Search Engine for Data Sets

research.google.com/tables[VLDB 2008, 2011, 2014]

Give Answers from Tables

It Better Be Right!

Answer with a Visualization

Long Term Goal: A Data-Guided Decision Engine

• Support decision making:– Healthcare debate– Should I install solar in my house?– Which charity should I contribute to?

• Show relevant data– Expose facets of the decision and enable drilldown– Show opposing views

• Manually curated examples of decision engines:– Justfacts.com, followthemoney.com, decide.com

WebTables on google.com!

HTML Lists

See Elmeleegy et al., VLDB 2009

Tree Search

Amish quilts

Parking tickets in India

Horses

The Deep Web [Madhavan et al., VLDB 2008]

Other Sources of Data

• Spreadsheets• CSV files• Tables embedded in PDF• XML, RDF• Visualizations• Online databases (Fusion Tables, Tableau, …)

Each source has its particularities, but most problems are common to all.

Non-Tabular Data in HTML

Vertical Tables

Data Optimized for Page Layout

Tabular Data Optimized for Site Layout

See [Ling et al, IJCAI 2013] for stitching tables within a site.

Semantics Can Be Brittle

Semantics are in Text

The Big Challenge

• Analyze natural language text as it pertains to structured data.

• Different from (open) information extraction that builds databases entirely from text.

• Good news: natural language parsing technology is now scalable.

First Step: Annotating Columns [Venetis et al., VLDB 2011]

Step 2: Understanding Relationships

Dictionary of Attributes

• I want the list of all attributes that countries may have.

• Freebase doesn’t have coffee production. • Is this an ontology?

– Not quite! I want an ontology suited for search.

Biperpedia: [VLDB 2014]

Ontology for Search Applications

Comparing to Freebase Coverage

Tower of Babel: Internet Style

In 2013, the coffee production of El Salvador dropped by 20% due to the coffee rust disease.

Coffee production el salvador 2013

El Salvador exports coffee 2013

Knowledge Graph

Tables Text

Conclusions

• This was a talk about Big Data:– Millions of people creating data sets– Billions of people seeing the data being impacted

• Get out there and find your favorite application.

• Dreams do come true:– At least as it pertains to structured data on the

Web!

References

• Fusion Tables: SIGMOD 2010, 2012• WebTables: VLDB 2008, 2009, 2011

top related