scraping in 60 minutes
DESCRIPTION
Presentation at Data Harvest 2014TRANSCRIPT
Paul BradshawLeanpub.com/scrapingforjournalists*
Scraping in 60 mins
Saturday, 10 May 14
https://www.youtube.com/watch?v=Efr-VEkwWoM
Saturday, 10 May 14
Saturday, 10 May 14
Saturday, 10 May 14
Saturday, 10 May 14
Saturday, 10 May 14
Saturday, 10 May 14
Saturday, 10 May 14
Saturday, 10 May 14
*
Saturday, 10 May 14
*
Saturday, 10 May 14
*
Function (Arguments)(aka parameters)
Saturday, 10 May 14
*
Function (arguments)=SUM(A2:A50)=AVERAGE(B2:B300)=COUNTIF(A10:A3000,”Smith”)
Saturday, 10 May 14
*
Function (parameters)=SUM(range of cells to be summed)=AVERAGE(range of cells to be averaged)=COUNTIF(range of cells to be counted,what to count)
Saturday, 10 May 14
*
(“string”, index)
Saturday, 10 May 14
*
Tip: search for documentation
Saturday, 10 May 14
*
Variable
Saturday, 10 May 14
*
Variables
Saturday, 10 May 14
*
Jargon checklist:FunctionArgumentsParametersStringIndexVariableDocumentation
Saturday, 10 May 14
Vote:=importXML orPython?
Saturday, 10 May 14
*
Another function?
Saturday, 10 May 14
*
Search for documentation!https://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/
Saturday, 10 May 14
*
Query (XPath)
Saturday, 10 May 14
*
XPath is a path through XML (or HTML)<table> = //table<table><tr> = //table//tr<table><tr><td> = //table//tr//td
Saturday, 10 May 14
*
Search for documentation!http://www.w3schools.com/XPath/xpath_syntax.asp
Saturday, 10 May 14
*
Tip: search for structure around data
http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
Saturday, 10 May 14
*
http://www.w4mpjobs.org/SearchJobs.aspx?
search=alljobs
http://www.w4mpjobs.org/SearchJobs.aspx?search=alljobs
Saturday, 10 May 14
*
Saturday, 10 May 14
*
"//div[@class= 'leftcolumn']"
Saturday, 10 May 14
*
//div[starts-with(@class, ‘jobWrap’)]
Saturday, 10 May 14
*
A crib sheet:
Saturday, 10 May 14
*
Chrome extension:
Saturday, 10 May 14
Saturday, 10 May 14
#!/usr/bin/env python
import scraperwiki ‘This is a Python script’
(Shebang)
import the Scraperwiki
library
Saturday, 10 May 14
#!/usr/bin/env python
import scraperwiki
html = scraperwiki.scrape('http://uk.soccerway.com/teams/netherlands/fortuna-sittard/1551/')
print html
Function(argument)
Saturday, 10 May 14
#!/usr/bin/env python
import scraperwiki
html = scraperwiki.scrape('http://uk.soccerway.com/teams/netherlands/fortuna-sittard/1551/')
print html
Comes from Scraperwiki
library (check documentation)
Saturday, 10 May 14
#!/usr/bin/env python
import scraperwiki
html = scraperwiki.scrape('http://uk.soccerway.com/teams/netherlands/fortuna-sittard/1551/')
print html
Variable (assigned with
= sign)
Statement used to show
variableSaturday, 10 May 14
#!/usr/bin/env python
import scraperwiki
html = scraperwiki.scrape('http://uk.soccerway.com/teams/netherlands/fortuna-sittard/1551/')
print html
Saturday, 10 May 14
Jargon checklist:LibraryShebangList
Saturday, 10 May 14
Paul BradshawLeanpub.com/scrapingforjournalists*
Thank you.
Saturday, 10 May 14