information extraction 3. design considerations, crawling and scraping · 2019-10-31 ·...

63
Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter semester 2019/20 1

Upload: others

Post on 16-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Information extraction

3. Design considerations,

crawling and scraping

Simon Razniewski

Winter semester 2019/20

1

Page 2: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Announcements

• Assignments

• Do not plagiarize

• Submit outputs where asked

• No lecture nor tutorial next week

• Automating extraction?

• Stay tuned…

• Visualizing KGs

• https://www.wikidata.org/wiki/Wikidata:Tools/Visualize_data

• https://angryloki.github.io/wikidata-graph-

builder/?property=P40&item=Q3044&iterations=100&limit=100

• https://angryloki.github.io/wikidata-graph-

builder/?property=P737&item=Q937&iterations=100&limit=100

• https://gate.d5.mpi-

inf.mpg.de/webyago3spotlxComp/SvgBrowser/

• https://developers.google.com/knowledge-graph

2

Page 3: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

• https://www.reddit.com/r/wikipedia/comments/dg6pnl/th

e_death_date_of_lucius_pinarius_wasnt_added_so/

• https://www.wikidata.org/wiki/Wikidata:Project_chat#unk

nown_values_for_people_who_have_long-since_died

3

Page 4: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Outline

1. Design considerations

2. Crawling

3. Scraping

4

Page 5: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

IE design considerations

1. What should be the output?

• Type of information

• Quality requirements

2. What is the best suited input?

3. Which method to get from input to output?

5

Page 6: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Inputs

Outputs

Premium Sources(Wikipedia, IMDB, …)

Semi-Structured Data(Infoboxes, Tables, Lists …)

Text Documents& Web Pages

Conversations& Behavior

Online Forums& Social Media

Queries& Clicks

Entity Names, Aliases & Classes

Entities inTaxonomy

RelationalStatements

Rules &Constraint

CanonicalizedStatements

Difficult Text(Books,

Interviews …)

High-Quality Text(News Articles, Wikipedia …)

MethodsRules & Patterns

LogicalInference

Statistical

InferenceDeep

LearningNLPTools

Web collections(Web crawls)

6

Page 7: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Inputs

Outputs

Premium Sources(Wikipedia, IMDB, …)

Semi-Structured Data(Infoboxes, Tables, Lists …)

Text Documents& Web Pages

Conversations& Behavior

Online Forums& Social Media

Queries& Clicks

Entity names, aliases & classes

Entities inTaxonomy

RelationalStatements

Rules &Constraint

CanonicalizedStatements

Difficult Text(Books,

Interviews …)

High-Quality Text(News Articles, Wikipedia …)

MethodsRules & Patterns

LogicalInference

Statistical

InferenceDeep

LearningNLPTools

Web collections(Web crawls)

Crawling

7

Page 8: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Inputs

Outputs

Premium Sources(Wikipedia, IMDB, …)

Semi-Structured Data(Infoboxes, Tables, Lists …)

Text Documents& Web Pages

Conversations& Behavior

Online Forums& Social Media

Queries& Clicks

Entity names, aliases & classes

Entities inTaxonomy

RelationalStatements

Rules &Constraint

CanonicalizedStatements

Difficult Text(Books,

Interviews …)

High-Quality Text(News Articles, Wikipedia …)

MethodsRules & Patterns

LogicalInference

Statistical

InferenceDeep

LearningNLPTools

Web collections(Web crawls)

Scraping

8

Page 9: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Outline

1. Design considerations

2. Crawling

3. Scraping

9

Page 10: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Acknowledgment

• Material adapted from Fabian Suchanek and Antoine Amarilli

10

Page 11: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

11

Page 12: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

12

Page 13: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

13

Page 14: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

14

Page 15: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

15

Page 16: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

16

Page 17: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

17

Page 18: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

18

Page 19: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Freshness problem (2)

• Prediction problem: Estimate page change frequency

• From previous change behavior

• Or from page content

• Optimization problem: Decide crawl frequency

• Fixed budget How to distribute them

• Flexible budget Cost-benefit framework needed

19

Page 20: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Estimating change frequencies

• Cho and Molina, TOIT 2003

• Model changes as Poisson processes (i.e., memoryless/

statistically independent)

• Extrapolate change frequency from previous visits

Daily visit for 10 days, 6 changes detected

Change frequency: 0.6 changes/day?

• Extrapolation underestimates change frequency due to multiple

change possibility

• Liang et al., IJCAI 2017

• Monitor news websites

• Build supervised prediction models based on page features

• Wijaya et al., EMNLP 2015

• Wikipedia-specific

• Learn state-change-indicating terms

• E.g., engage, divorce

20

Page 21: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Wijaya et al., EMNLP 2015

21

Page 22: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Distributing crawl resources

[Razniewski, CIKM 2016]

• Ingredients:

• Benefit of an up-to-date website

• Synonymous: cost of outdated website

• Cost of a crawl action

• Decay behavior

Page-specific recrawl frequency that maximizes

benefit minus cost

22

Page 23: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Decay behaviour

23

Page 24: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Observed decay behaviour

24

Page 25: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Average freshness F

25

Page 26: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Net income NI

26

B…Benefit/time unit

F…Average freshness

Λ… decay coefficient

u…update interval length

C…cost of an update

Optimum via

common algebra

Page 27: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Examples for address updates

NI over u

27

Assumption: benefit over one year = 100 x cost of single crawl

Actual ratio magnitudes lower, e.g., 0.003 Cents/crawl

[http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/]

(and for 580 $ on Amazon EC2)

Page 28: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

28

and later Google)

Page 29: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

29

Page 30: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

30

Page 31: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

31

https://www.mpi-inf.mpg.de/robots.txt

https://www.google.de/robots.txt

Page 32: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

32

Page 33: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

33

Page 34: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

34

Try often enough

Page 35: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

35

Page 36: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

36

Deep web / dark web

Page 38: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Insights from crawling mpi-inf.mpg.de

• URL ending inclusion/exclusion criteria need thought

• Long (machine-generated URLs) need exclusion

• Beyond that no issues

• 35 lines in Python

• Sequential runtime for 2000 pages: ~10 minutes

• Completeness?

38

Page 39: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Outline

1. Design considerations

2. Crawling

3. Scraping

39

Page 40: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Inputs

Outputs

Premium Sources(Wikipedia, IMDB, …)

Semi-Structured Data(Infoboxes, Tables, Lists …)

Text Documents& Web Pages

Conversations& Behavior

Online Forums& Social Media

Queries& Clicks

Entity names, aliases & classes

Entities inTaxonomy

RelationalStatements

Rules &Constraint

CanonicalizedStatements

Difficult Text(Books,

Interviews …)

High-Quality Text(News Articles, Wikipedia …)

MethodsRules & Patterns

LogicalInference

Statistical

InferenceDeep

LearningNLPTools

Web collections(Web crawls)

Crawling

40

Page 41: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Inputs

Outputs

Premium Sources(Wikipedia, IMDB, …)

Semi-Structured Data(Infoboxes, Tables, Lists …)

Text Documents& Web Pages

Conversations& Behavior

Online Forums& Social Media

Queries& Clicks

Entity names, aliases & classes

Entities inTaxonomy

RelationalStatements

Rules &Constraint

CanonicalizedStatements

Difficult Text(Books,

Interviews …)

High-Quality Text(News Articles, Wikipedia …)

MethodsRules & Patterns

LogicalInference

Statistical

InferenceDeep

LearningNLPTools

Web collections(Web crawls)

Scraping

41

Page 42: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

42

Page 43: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

43

Page 44: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

44

Scraping aims to reconstruct the KB

Page 45: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

45

Page 46: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

46

Page 47: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

47[https://www.w3schools.com/xml/xml_xpath.asp]

[https://devhints.io/xpath]

Page 48: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

48https://www.freeformatter.com/xpath-tester.html

<html>

<body>

<b>Shrek</b>

<ul>

<li>Creator: <b>W. Steig</b></li>

<li>Duration: <i>84m</i></li>

</ul>

</body>

</html>

Page 49: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Scraping: Browser

• “Try XPath” Firefox addin

• //h3[@class='pi-data-label pi-secondary-font']

• Firefox console

• $x('//h3[@class=\'pi-data-label pi-secondary-font\']')

• //h3[@class='pi-data-label pi-secondary-font'] |

//div[@class='pi-data-value pi-font']

49

Page 50: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Scraping in Python - XPath

50

Page 51: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

51

Page 52: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

52

Page 53: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

53

Page 54: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

54

gender: string)))

Page 55: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

55

Crescenzi et al., VDLB 2001

http://www.vldb.org/conf/2001/P109.pdf

Finds least upper bounds in regex lattice

Page 56: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

56

Crescenzi et al., VDLB 2001

http://www.vldb.org/conf/2001/P109.pdf

Page 57: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

57

Page 58: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

58

Page 59: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Scraping in Python – BeautifulSoup (1)

• Python library for pulling data out of HTML and XML

files.

59

<html>

<head>

<title>

The Dormouse's story

</title>

</head>

<body>

Once upon a time there were

three little sisters; and their names

were <a class="sister"

href="http://example.com/elsie"

id="link1">Elsie</a> , <a

class="sister"

href="http://example.com/lacie"

id="link2"> Lacie</a> and …

soup.title

# <title>The Dormouse's story</title>

soup.title.string

# u'The Dormouse's story'

soup.title.parent.name

# u‘head'

soup.a

# <a class="sister" href="http://ex.com/elsie"

id="link1">Elsie</a>

soup.find_all('a')

# [<a class="sister" href="http://ex.com/elsie"

id="link1">Elsie</a>,

# <a class="sister" href="http://ex.com/lacie"

id="link2">Lacie</a>,

# <a class="sister" href="http://ex.com/tillie"

id="link3">Tillie</a>]

Page 60: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Scraping in Python – BeautifulSoup (2)

60

Page 61: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

XPath vs. BeautifulSoup vs …

• XPath: Generic query language to select nodes in XML

(HTML) documents

• Queries can be issued from Python, Java, C, …

• BeautifulSoup

• Python library to manipulate websites as Python objects

• Scrapy

• Python library to crawl websites

• Selenium

• Actual scripted browser interaction

To get around Javascript etc.

61

https://www.udemy.com/tutorial/scrapy-tutorial-web-scraping-with-python/scrapy-vs-beautiful-soup-vs-selenium/

Page 62: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Assignment 3

• No crawling (ethics…)

• 1x Extraction from dump – infobox treasure

• Remember design considerations

• XML format, but essential content not structured by XML tags

pattern matching/regex

• 2x Scraping

• BeautifulSoup recommended, but XPath fine as well

• Reading on large-scale WP extraction:

DBpedia extraction framework

62

Page 63: Information extraction 3. Design considerations, crawling and scraping · 2019-10-31 · Information extraction 3. Design considerations, crawling and scraping Simon Razniewski Winter

Take home

1. Think about goal, sources, methods

2. Crawling

• BFS to achieve coverage

• Challenges with traps and deep web

3. Scraping

• Reverse-engineering of template-based websites

63