crawling and scraping tutorial at the digital methods summer school 2013

44
Crawling and Scraping The Issuecrawler and the Lippmannian device. Michael Stevenson

Upload: digital-methods-initiative

Post on 17-Jun-2015

397 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Crawling and ScrapingThe Issuecrawler and the Lippmannian device.

Michael Stevenson

Page 2: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Issuecrawler. What does it do?

Page 3: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Body text

Body Text

Site

A

B

C

CRAWL STARTING POINTS

Page 4: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Body text

Body Text

Site

A

B

C

CRAWL STARTING POINTS

Site

A

B

C

D

CRAWL DEPTH ONEfollow all starting points' outlinks

Page 5: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Body text

Body Text

Site

A

B

C

CRAWL STARTING POINTS

Site

A

B

C

D

CRAWL DEPTH ONEfollow all starting points' outlinks

Site

A

B

C

D

E

F

G

H

CRAWL DEPTH TWOfollow all outlinks from the pages found in the previous depth

Page 6: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Body text

Body Text

ANALYSIS SNOWBALLretain all links and sites discovered during the crawl

Site

A

B

C

D

E

F

G

H

Page 7: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Body text

Body Text

ANALYSIS INTER-ACTORretain only links between the starting points

Site

A

B

C

Page 8: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Body text

Body Text

ANALYSIS CO-LINKretain sites that receive links from at least two other sites

Site

B

D

Page 9: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Climate change blogs network

Starting points: blogroll from RealClimate.org

Page 10: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Climate change blogs network

Results: mix of blogs, social media, traditional media and governmental and non-governmental organizations.

Page 11: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Climate change science network

Starting points: “science links” from RealClimate.orgResults: mix of governmental, non-governmental, educational and media organizations

Page 12: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

OK... We have the issue networks, but what can we

can say about their content?

Page 13: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device.(aka the google scraper)

Page 14: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

What does it do?

1. Explore a source’s partisanship or commitment.

2. Show the issue agenda of an organization or movement.

Source cloud Issue cloud

Partisanship or commitment. Which sources mention the expert’s name?

Issue agenda. Which issues are on the agenda of an organization or movement?

Page 15: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Source cloud”Showing the partisanship or

commitment of sources to one name

Craig Venter's presence in the Synthetic Biology issue space, March 2008. Top sources on "synthetic biology" according to a Google query, with number of mentions of Venter per source, ordered.

Page 16: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Source cloud”

Method for showing the partisanship or commitment of sources to names

1. Gather source list (e.g. through Issuecrawler)2. Query source list for one or more experts

Page 17: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Source cloud”Showing the partisanship or

commitment of sources to names

Climate Change Skeptics: Who recognizes them?

(Digital Methods Initiative, 2007)https://wiki.digitalmethods.net/Dmi/ClimateChangeSkeptics

Page 18: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Making an Issue cloud”

An organization’s issue agenda (or commitment)

Public Knowledge, a digital rights NGO, has issues. Which are they most committed to?

Page 19: Crawling and Scraping tutorial at the Digital Methods Summer School 2013
Page 20: Crawling and Scraping tutorial at the Digital Methods Summer School 2013
Page 21: Crawling and Scraping tutorial at the Digital Methods Summer School 2013
Page 22: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Issue cloud”

Showing the issue commitments of the NGO, Public Knowledge

Public Knowledge's issue commitment. Lower six issues on Public Knowledge's issue list, ranked according to number of mentions of issues on publicknowledge.org, 2 October 2009.

Page 23: Crawling and Scraping tutorial at the Digital Methods Summer School 2013
Page 24: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Making an Issue cloud”

Greenpeace issues, http://www.greenpeace.org/international/campaigns.

Stop climate changeProtect ancient forestsDefending our OceansSay no to genetic engineeringEliminate toxic chemicalsDemand Peace and DisarmamentEnd the nuclear ageEncourage sustainable trade

Keep most significant issue language.

"climate change""ancient forests"oceans"genetic engineering""toxic chemicals"disarmament"nuclear power""sustainable trade"

Page 25: Crawling and Scraping tutorial at the Digital Methods Summer School 2013
Page 26: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Issue cloud”

Greenpeace’s issue agenda (distribution of commitment)

Greenpeace's issue commitment. Greenpeace's campaign issue list, ranked according to number of mentions of issues on greenpeace.org, 11 October 2009.

Page 27: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Making an Issue cloud”

Multiple sources, multiple issues

What is the agenda of the global human rights network?

Which issues are at the top and

at the bottom of the agenda?

What is the current level of commitment to a particular issue?

Page 28: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Making an Issue cloud”

Multiple sources, multiple issues

This is more complicated, but still doable(Govcom.org, University of Pittsburg, UMass Amhearst, ongoing)

Page 29: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Making an Issue cloud”

Take three good lists of human rights organizations (global south, global north, UN’s)

Page 30: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Making an Issue cloud”

Make a list of all issues listed on all Websites

Page 31: Crawling and Scraping tutorial at the Digital Methods Summer School 2013
Page 32: Crawling and Scraping tutorial at the Digital Methods Summer School 2013
Page 33: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Issue cloud”

Showing the issue commitments of global human rights network

Global human rights issue agenda. Global human rights actors' issues, ranked according to the estimated number of Google mentions on a set of global human rights actors' websites, 31 March 2009.

Page 34: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device. “Issue cloud”

Showing the issue commitments of global human rights network

Global human rights issue agenda, bottom. Global human rights actors' issues, ranked according to the estimated number of Google mentions on a set of global human rights actors' websites, 31 March 2009.

Page 35: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device.

Partisanship check. Which side of the controversy is an actor on?

Use the source cloud

Page 36: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Lippmannian device.

1. Check an organziation’s issue agenda. What are its current commitments?

2. Check a national or global movement’s issue agenda. What are its current commitments?

Use the issue cloud

Page 37: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Questions.

Page 38: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Exercise: Sourcing Climate Change

Skeptics.

Page 39: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Body text

Body Text

Climate Change Sceptics on the Web (Frederick Seitz)

Research Question_To what extent are climate change 'skeptics' present in the climate change spaces on the Web?Findings_There is distance between the skeptics and the top of the search engine returns.

Source_google.comQuery_“Frederick Seitz”Method_Search for query “Frederick Seitz” in top 100. Organized in order.Tools_Google Scraper and Tag Cloud GeneratorDate_30 July 2007

Product_of the Digital Methods Initiative, dmi.mediastudies.nl. Analysis_by Bram Nijhof, Richard Rogers and Laura van der Vlies. Design_Anne Helmond.

CC_BY:NC:SA

campaigncc.org (1)

climateark.org (4)marshall.org (8)

realclimate.org (35)sourcewatch.org (21)

abc.net.au (0)

acfonline.org.au (0)

bbc.co.uk (0) bom.gov.au (0)

cbc.ca (0)

ciel.org (0)

climatechallenge.gov.uk (0)

climatechange.ca.gov (0)

climatechange.com.au (0)

climatechangecentral.com (0)

climatechangecollege.org (0)

climatecrisis.net (0)

climatescience.gov (0)

dar.csiro.au (0)

davidsuzuki.org (0)

defra.gov.uk (0)

dfat.gov.au (0)

ec.gc.ca (0)

ecn.ac.uk (0)

ecokids.ca (0)

ecy.wa.gov (0)

eea.europa.eu (0)

eldis.org (0)

energy.gov (0)

envirolink.org (0)

epa.gov (0)

exploratorium.edu (0)

faqs.org (0)

foe.co.uk (0)

ft.com (0)

g8.gov.uk (0)

gcrio.org (0)

greenpeace.org (0)

grida.no (0)

guardian.co.uk (0)

iea.org (0)

iisd.org (0)

ipcc.ch (0)

iucn.org (0)

ltscotland.org.uk (0)

metoffice.gov.uk (0)

mfe.govt.nz (0)

mofa.go.jp (0)

nature.com (0) nature.org (0)

ncdc.noaa.gov (0)

open2.net (0)

panda.org (0)

pewclimate.org (0)

royalsoc.ac.uk (0)

scidev.net (0)

scienceagogo.com (0)

state.gov (0)

theglobeandmail.com (0)

ucar.edu (0)

un.org (0)

unep.org (0)

who.int (0)

whoi.edu (0)

worldwildlife.org (0)

CLIMATE CHANGESCEPTICS

Page 40: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Research Question:Which climate change issue actors mention the skeptics, and what kinds of actors are more likely to mention them?

Method:Comparative Query: skeptics in three source sets (‘top’ sources, climate change blogs and climate change science network), outputting source cloud for each.

Page 41: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Source Sets:

(1) Top ten Google returns for “climate change” (mix of media as well as governmental organizations)

Page 42: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Source Sets:

(2) Climate change blogs network (IssueCrawler results - mix of blogs, social media, traditional media and governmental and non-governmental organizations)

Page 43: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Source Sets:

(3) Climate change science network (IssueCrawler results - governmental, non-governmental, educational and media organizations)

Page 44: Crawling and Scraping tutorial at the Digital Methods Summer School 2013

Steps:- Install the DMI toolbar, and open the Lippmannian device (aka Google Scraper - see tools.digitalmethods.net).

- Acquire source sets and skeptics list.

- Enter source sets and skeptics names. Query the source sets separately, and remember to use “” to get exact returns.

- Wait, fill in CAPTCHA’s if necessary. Also use this moment to discuss hypotheses.

- Explore the output, and present findings.