scraping in 60 minutes

40
Paul Bradshaw Leanpub.com/scrapingforjournalists * Scraping in 60 mins Saturday, 10 May 14

Upload: paul-bradshaw

Post on 06-May-2015

4.192 views

Category:

Education


0 download

DESCRIPTION

Presentation at Data Harvest 2014

TRANSCRIPT

Page 1: Scraping in 60 minutes

Paul BradshawLeanpub.com/scrapingforjournalists*

Scraping in 60 mins

Saturday, 10 May 14

Page 2: Scraping in 60 minutes

https://www.youtube.com/watch?v=Efr-VEkwWoM

Saturday, 10 May 14

Page 3: Scraping in 60 minutes

Saturday, 10 May 14

Page 4: Scraping in 60 minutes

Saturday, 10 May 14

Page 5: Scraping in 60 minutes

Saturday, 10 May 14

Page 6: Scraping in 60 minutes

Saturday, 10 May 14

Page 7: Scraping in 60 minutes

Saturday, 10 May 14

Page 8: Scraping in 60 minutes

Saturday, 10 May 14

Page 9: Scraping in 60 minutes

Saturday, 10 May 14

Page 10: Scraping in 60 minutes

*

Saturday, 10 May 14

Page 11: Scraping in 60 minutes

*

Saturday, 10 May 14

Page 12: Scraping in 60 minutes

*

Function (Arguments)(aka parameters)

Saturday, 10 May 14

Page 13: Scraping in 60 minutes

*

Function (arguments)=SUM(A2:A50)=AVERAGE(B2:B300)=COUNTIF(A10:A3000,”Smith”)

Saturday, 10 May 14

Page 14: Scraping in 60 minutes

*

Function (parameters)=SUM(range of cells to be summed)=AVERAGE(range of cells to be averaged)=COUNTIF(range of cells to be counted,what to count)

Saturday, 10 May 14

Page 15: Scraping in 60 minutes

*

(“string”, index)

Saturday, 10 May 14

Page 16: Scraping in 60 minutes

*

Tip: search for documentation

Saturday, 10 May 14

Page 17: Scraping in 60 minutes

*

Variable

Saturday, 10 May 14

Page 18: Scraping in 60 minutes

*

Variables

Saturday, 10 May 14

Page 19: Scraping in 60 minutes

*

Jargon checklist:FunctionArgumentsParametersStringIndexVariableDocumentation

Saturday, 10 May 14

Page 20: Scraping in 60 minutes

Vote:=importXML orPython?

Saturday, 10 May 14

Page 21: Scraping in 60 minutes

*

Another function?

Saturday, 10 May 14

Page 22: Scraping in 60 minutes

*

Search for documentation!https://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/

Saturday, 10 May 14

Page 23: Scraping in 60 minutes

*

Query (XPath)

Saturday, 10 May 14

Page 24: Scraping in 60 minutes

*

XPath is a path through XML (or HTML)<table> = //table<table><tr> = //table//tr<table><tr><td> = //table//tr//td

Saturday, 10 May 14

Page 25: Scraping in 60 minutes

*

Search for documentation!http://www.w3schools.com/XPath/xpath_syntax.asp

Saturday, 10 May 14

Page 26: Scraping in 60 minutes

*

Tip: search for structure around data

http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs

Saturday, 10 May 14

Page 28: Scraping in 60 minutes

*

Saturday, 10 May 14

Page 29: Scraping in 60 minutes

*

"//div[@class= 'leftcolumn']"

Saturday, 10 May 14

Page 30: Scraping in 60 minutes

*

//div[starts-with(@class, ‘jobWrap’)]

Saturday, 10 May 14

Page 31: Scraping in 60 minutes

*

A crib sheet:

Saturday, 10 May 14

Page 32: Scraping in 60 minutes

*

Chrome extension:

Saturday, 10 May 14

Page 33: Scraping in 60 minutes

Saturday, 10 May 14

Page 37: Scraping in 60 minutes

#!/usr/bin/env python

import scraperwiki

html = scraperwiki.scrape('http://uk.soccerway.com/teams/netherlands/fortuna-sittard/1551/')

print html

Variable (assigned with

= sign)

Statement used to show

variableSaturday, 10 May 14

Page 38: Scraping in 60 minutes

#!/usr/bin/env python

import scraperwiki

html = scraperwiki.scrape('http://uk.soccerway.com/teams/netherlands/fortuna-sittard/1551/')

print html

Saturday, 10 May 14

Page 39: Scraping in 60 minutes

Jargon checklist:LibraryShebangList

Saturday, 10 May 14

Page 40: Scraping in 60 minutes

Paul BradshawLeanpub.com/scrapingforjournalists*

Thank you.

Saturday, 10 May 14