web scraping and social media scraping {...
TRANSCRIPT
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Web scraping and social media scraping ndashintroduction
Jacek Lewkowicz Dorota Celinska-Kopczynska
University of Warsaw
March 18 2019
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Motivation
Tons of (potentially useful) information on the Web
Instead of manual gathering the information needed weemploy computer programs to do it automatically
We have to understand the structure of the webpage and findgeneral scheme for data collection
Web scraping may be useful and efficient here
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Web scraping
The process of taking unstructured information from theWeb
and processing it into structured data for later analysis
Commonly used not only by researchers Web crawlerswebspiderssearch bots used by search engines
To scrape the web (=drag scraping) not to scrap (=discardscrapping)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Bots crawlers spiders scrapers
A bot is a software application that runs automated usuallyrepetitive tasks (scripts) over the Internet
A crawler or spider is a bot which systematically browses theInternet typically for the purpose of Web indexing In otherwords crawler surfs the web following links ndash it does not haveto extract data
A scraper takes downloaded pages and attempts to extractdata from them
However there are no precise defitnions
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Example areas of usage
Download results of general elections
Download demo MP3
Download legal decisions
Download comments and usersrsquo activity from social media
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Alternatives to traditional scraping
ScraperWiki
Google Refine
Outwit
SocSciBot
(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Is it legal
Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website
Look for disclaimers and Terms of (Fair) Use
Scraping may mean copyright infringement
Data privacy sensitive information
web crawling ndash difficult matter
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Controversial cases
eBay vs Bidderrsquos Edge
One company crawls information from another
Facebook vs Meltwater
Prior written permissions needed
US vs Aaron Swartz
AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013
GHTorrent issue32 incident
httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Web scraping for social good or evil
People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy
Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using
Celebgate httpenwikipediaorgwikiICloud_leaks_of_
celebrity_photos
AI gone too far httpwwwtelegraphcouktechnology20170908
ai-can-tell-people-gay-straight-one-photo-face
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash robotstxt
Many web crawlers are hunting for content
Web robots keep some indices actual
robotstxt file and other methods are used for keep theserver traffic in check
User-agent
Disallow private
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Motivation
Tons of (potentially useful) information on the Web
Instead of manual gathering the information needed weemploy computer programs to do it automatically
We have to understand the structure of the webpage and findgeneral scheme for data collection
Web scraping may be useful and efficient here
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Web scraping
The process of taking unstructured information from theWeb
and processing it into structured data for later analysis
Commonly used not only by researchers Web crawlerswebspiderssearch bots used by search engines
To scrape the web (=drag scraping) not to scrap (=discardscrapping)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Bots crawlers spiders scrapers
A bot is a software application that runs automated usuallyrepetitive tasks (scripts) over the Internet
A crawler or spider is a bot which systematically browses theInternet typically for the purpose of Web indexing In otherwords crawler surfs the web following links ndash it does not haveto extract data
A scraper takes downloaded pages and attempts to extractdata from them
However there are no precise defitnions
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Example areas of usage
Download results of general elections
Download demo MP3
Download legal decisions
Download comments and usersrsquo activity from social media
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Alternatives to traditional scraping
ScraperWiki
Google Refine
Outwit
SocSciBot
(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Is it legal
Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website
Look for disclaimers and Terms of (Fair) Use
Scraping may mean copyright infringement
Data privacy sensitive information
web crawling ndash difficult matter
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Controversial cases
eBay vs Bidderrsquos Edge
One company crawls information from another
Facebook vs Meltwater
Prior written permissions needed
US vs Aaron Swartz
AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013
GHTorrent issue32 incident
httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Web scraping for social good or evil
People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy
Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using
Celebgate httpenwikipediaorgwikiICloud_leaks_of_
celebrity_photos
AI gone too far httpwwwtelegraphcouktechnology20170908
ai-can-tell-people-gay-straight-one-photo-face
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash robotstxt
Many web crawlers are hunting for content
Web robots keep some indices actual
robotstxt file and other methods are used for keep theserver traffic in check
User-agent
Disallow private
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Web scraping
The process of taking unstructured information from theWeb
and processing it into structured data for later analysis
Commonly used not only by researchers Web crawlerswebspiderssearch bots used by search engines
To scrape the web (=drag scraping) not to scrap (=discardscrapping)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Bots crawlers spiders scrapers
A bot is a software application that runs automated usuallyrepetitive tasks (scripts) over the Internet
A crawler or spider is a bot which systematically browses theInternet typically for the purpose of Web indexing In otherwords crawler surfs the web following links ndash it does not haveto extract data
A scraper takes downloaded pages and attempts to extractdata from them
However there are no precise defitnions
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Example areas of usage
Download results of general elections
Download demo MP3
Download legal decisions
Download comments and usersrsquo activity from social media
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Alternatives to traditional scraping
ScraperWiki
Google Refine
Outwit
SocSciBot
(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Is it legal
Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website
Look for disclaimers and Terms of (Fair) Use
Scraping may mean copyright infringement
Data privacy sensitive information
web crawling ndash difficult matter
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Controversial cases
eBay vs Bidderrsquos Edge
One company crawls information from another
Facebook vs Meltwater
Prior written permissions needed
US vs Aaron Swartz
AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013
GHTorrent issue32 incident
httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Web scraping for social good or evil
People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy
Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using
Celebgate httpenwikipediaorgwikiICloud_leaks_of_
celebrity_photos
AI gone too far httpwwwtelegraphcouktechnology20170908
ai-can-tell-people-gay-straight-one-photo-face
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash robotstxt
Many web crawlers are hunting for content
Web robots keep some indices actual
robotstxt file and other methods are used for keep theserver traffic in check
User-agent
Disallow private
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Bots crawlers spiders scrapers
A bot is a software application that runs automated usuallyrepetitive tasks (scripts) over the Internet
A crawler or spider is a bot which systematically browses theInternet typically for the purpose of Web indexing In otherwords crawler surfs the web following links ndash it does not haveto extract data
A scraper takes downloaded pages and attempts to extractdata from them
However there are no precise defitnions
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Example areas of usage
Download results of general elections
Download demo MP3
Download legal decisions
Download comments and usersrsquo activity from social media
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Alternatives to traditional scraping
ScraperWiki
Google Refine
Outwit
SocSciBot
(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Is it legal
Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website
Look for disclaimers and Terms of (Fair) Use
Scraping may mean copyright infringement
Data privacy sensitive information
web crawling ndash difficult matter
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Controversial cases
eBay vs Bidderrsquos Edge
One company crawls information from another
Facebook vs Meltwater
Prior written permissions needed
US vs Aaron Swartz
AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013
GHTorrent issue32 incident
httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Web scraping for social good or evil
People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy
Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using
Celebgate httpenwikipediaorgwikiICloud_leaks_of_
celebrity_photos
AI gone too far httpwwwtelegraphcouktechnology20170908
ai-can-tell-people-gay-straight-one-photo-face
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash robotstxt
Many web crawlers are hunting for content
Web robots keep some indices actual
robotstxt file and other methods are used for keep theserver traffic in check
User-agent
Disallow private
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Example areas of usage
Download results of general elections
Download demo MP3
Download legal decisions
Download comments and usersrsquo activity from social media
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Alternatives to traditional scraping
ScraperWiki
Google Refine
Outwit
SocSciBot
(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Is it legal
Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website
Look for disclaimers and Terms of (Fair) Use
Scraping may mean copyright infringement
Data privacy sensitive information
web crawling ndash difficult matter
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Controversial cases
eBay vs Bidderrsquos Edge
One company crawls information from another
Facebook vs Meltwater
Prior written permissions needed
US vs Aaron Swartz
AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013
GHTorrent issue32 incident
httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Web scraping for social good or evil
People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy
Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using
Celebgate httpenwikipediaorgwikiICloud_leaks_of_
celebrity_photos
AI gone too far httpwwwtelegraphcouktechnology20170908
ai-can-tell-people-gay-straight-one-photo-face
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash robotstxt
Many web crawlers are hunting for content
Web robots keep some indices actual
robotstxt file and other methods are used for keep theserver traffic in check
User-agent
Disallow private
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Alternatives to traditional scraping
ScraperWiki
Google Refine
Outwit
SocSciBot
(arguably) APIs already scraped collections of data availableonline (eg GHTorrent) or API mirrors
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Is it legal
Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website
Look for disclaimers and Terms of (Fair) Use
Scraping may mean copyright infringement
Data privacy sensitive information
web crawling ndash difficult matter
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Controversial cases
eBay vs Bidderrsquos Edge
One company crawls information from another
Facebook vs Meltwater
Prior written permissions needed
US vs Aaron Swartz
AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013
GHTorrent issue32 incident
httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Web scraping for social good or evil
People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy
Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using
Celebgate httpenwikipediaorgwikiICloud_leaks_of_
celebrity_photos
AI gone too far httpwwwtelegraphcouktechnology20170908
ai-can-tell-people-gay-straight-one-photo-face
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash robotstxt
Many web crawlers are hunting for content
Web robots keep some indices actual
robotstxt file and other methods are used for keep theserver traffic in check
User-agent
Disallow private
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Is it legal
Technically usually it is However there are situations inwhich you are not allowed to gather the data from thegiven website
Look for disclaimers and Terms of (Fair) Use
Scraping may mean copyright infringement
Data privacy sensitive information
web crawling ndash difficult matter
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Controversial cases
eBay vs Bidderrsquos Edge
One company crawls information from another
Facebook vs Meltwater
Prior written permissions needed
US vs Aaron Swartz
AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013
GHTorrent issue32 incident
httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Web scraping for social good or evil
People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy
Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using
Celebgate httpenwikipediaorgwikiICloud_leaks_of_
celebrity_photos
AI gone too far httpwwwtelegraphcouktechnology20170908
ai-can-tell-people-gay-straight-one-photo-face
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash robotstxt
Many web crawlers are hunting for content
Web robots keep some indices actual
robotstxt file and other methods are used for keep theserver traffic in check
User-agent
Disallow private
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Controversial cases
eBay vs Bidderrsquos Edge
One company crawls information from another
Facebook vs Meltwater
Prior written permissions needed
US vs Aaron Swartz
AS arrested in 2011 for having illegally downloaded millions ofarticles from the article archive JSTOR The case wasdismissed after Swartzrsquo suicide in January 2013
GHTorrent issue32 incident
httpspeakerdeckcomplayer1c64fd1e7dfe4032aff246b2dd1195bf
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Web scraping for social good or evil
People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy
Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using
Celebgate httpenwikipediaorgwikiICloud_leaks_of_
celebrity_photos
AI gone too far httpwwwtelegraphcouktechnology20170908
ai-can-tell-people-gay-straight-one-photo-face
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash robotstxt
Many web crawlers are hunting for content
Web robots keep some indices actual
robotstxt file and other methods are used for keep theserver traffic in check
User-agent
Disallow private
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Web scraping for social good or evil
People tend to be extremely paranoic about their data privacywhile being at the same time extremely careless about theirdata privacy
Same people who cover their cameras in laptops do notpotentially care about the photos they upload toFacebookother social media or the apps they are using
Celebgate httpenwikipediaorgwikiICloud_leaks_of_
celebrity_photos
AI gone too far httpwwwtelegraphcouktechnology20170908
ai-can-tell-people-gay-straight-one-photo-face
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash robotstxt
Many web crawlers are hunting for content
Web robots keep some indices actual
robotstxt file and other methods are used for keep theserver traffic in check
User-agent
Disallow private
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash robotstxt
Many web crawlers are hunting for content
Web robots keep some indices actual
robotstxt file and other methods are used for keep theserver traffic in check
User-agent
Disallow private
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
robotstxt ndash limitations
Considered outdated standard due to its limitations
The subsite can be still indexed if another site redirects to it
Even if the subsite is not listed in robotstxt it does notmean the owner allows to scrape it
Imagine a situation between the host and the guest Ok feelat home but do not take the money I put in the jar on thethird shelf in the kitchen cupboard The guest should obeybut will they The same is with the robotstxt whichdiscloses the information that should stay hidden eg thelinks to adminrsquos login sites
Read httpsupportgooglecomwebmastersanswer6062608
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Legal issues and famous casesPreventing bots
Counteracting scraping ndash ltmetagt tags
A modern approach to blocking bots from accessing the partof the website is using of ltmetagt tags
ltmeta name=robots content=noindexgt
Read httpssupportgooglecomwebmastersanswer93710
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
What makes a crawler polite
A polite crawler respects robotstxt
A polite crawler never degrades a websitersquos performance
A polite crawler identifies its creator with contact information
A polite crawler does not drive system administrators crazy
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 1 ndash Terms of Use
Make sure your crawler follows the rules contained in Terms of(Fair) Use
Do check robotstxt file
(we know you have no time) you may write a program whichparses robotstxt and enables to identify particularuser-agents or corresponding rules of access
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 2 ndash Identify yourself
Include your company name and an email address or websitein the requestrsquos User-Agent header
For example Googlersquos crawler user agent is ldquoGooglebotrdquo
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 3 ndash Crawl delay
Make sure your crawler does not hit a website too hard
Respect the delay that crawlers should wait between requestsby following the robotstxt Crawl-Delay directive
If you do not they will make you do
2012-12-21 050436+0800 [working] DEBUG Retrying
httpwwwexamplecomprofilephpid=1580gt (failed 1
times) 503 Service Unavailable
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Responsible crawling 4 ndash Take care of your data
You should pay attention to what happens with the data yougathered
Consider which part of the analyses andor data sets can bedisplayed publicly
Respect the sensitivity of the information
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Good practices amp ethics in a nutshell
Respect the hosting sitersquos wishes
robotstxt as a list of instruction to botsscrapersTerms of (Fair) Use ndash which elements of the website youshould not scrap
Respect the hosting sitersquos bandwidth
Scraping may be costly or result in the site going down
Respect the law
Terms amp agreements
Take care of the data
Consider which information can be displayed publicly or not
If possible use APIs APIs mirrors or collections of dataalready scraped for you
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Requests library
Python core libraries may be enough but there are moreappropriate packages
Requests library is suitable for complicated HTTP requestscookies headers and other issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Beautiful what
Beautiful Soup is Python library
Lewis Carrollrsquos poem from Alicersquos Adventures in Wonderland
The tool is for making sense of nonsensical
Beautiful Soup so rich and greenWaiting in a hot tureen
Who for such dainties would not stoopSoup of the evening beautiful Soup
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Installing Beautiful Soup
httpwwwcrummycomsoftwareBeautifulSoupbs4doc
httppypipythonorgpypisetuptools
Think about setting Python virtual environments
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install beautifulsoup4
Other environments
python setuppy install
pip install beautifulsoup4
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Problems after installation
Authors of the library suggest that usually known problems arerelated to Python 2 vs Python 3 issues
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Typical way of working
You work with file containing python code and import thenecessary libraries
It requires some skills in programming
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
What is Scrapy
Scrapy is a framework which means a lot of issues related toscraping have already been solved for you
and you may use ldquoautomagicalrdquo tools to deal with them(with all the blessings and curses it provides)
Requires minimal knowledge and skills in Python andprogramming
Handles a lot of complexity of finding amp evaluating links on awebsite scraping whole domains or lists of domains
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash installation
httpscrapyorg
httpdocscrapyorgenlatestintroinstallhtml
httppypipythonorgpypisetuptools
Think about setting Python virtual environments(recommended)
Unixlike (Mac and GNULinux)
sudo easy install pip
pip install Scrapy
Other environments
Read the manual recommended use of anaconda
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash dependencies
If you encounter problems during the installation probablyyou should check for the dependencies eg
lxml ndash efficient XML and HTML parserw3lib ndash a multi-purpose helper for dealing with URLs and webpage encodingscryptography and pyOpenSSL ndash to deal with variousnetwork-level security needs
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash typical way of working
You work with projects in a new directory
Each Scrapy object stands for a single page on the website
Each scrapper has a few core functions and structures
Crawlers are run from the command line
Troubleshooting usually within configuration files(settingspy)
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Request libraryBeautiful SoupScrapy
Scrapy ndash responsible scraping
ROBOTSTXT OBEY = True
USER AGENT = rsquoMyCompany-MyCrawler
(botmycompanycom)rsquo
DOWNLOAD DELAY
CONCURRENT REQUESTS PER DOMAIN
2016-08-19 161256 [scrapy] DEBUG Forbidden by
robotstxt ltGET httpwebsitecomlogingt
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-
Motivation and definitionsControversies
Good practicesIntroduction to software
Homework
Homework
Install the libraries and frameworks on your hardware
Read the Introduction and Creating a project sections inScrapy tutorialhttpdocscrapyorgenlatestintrotutorialhtml
- Motivation and definitions
- Controversies
-
- Legal issues and famous cases
- Preventing bots
-
- Good practices
- Introduction to software
-
- Request library
- Beautiful Soup
- Scrapy
-
- Homework
-