art of the scrape!!!! show the internet who’s boss. scrape it!
TRANSCRIPT
![Page 1: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/1.jpg)
Art of the scrape!!!!
Show the internet who’s boss.
Scrape it!
![Page 2: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/2.jpg)
Before we begin!
You might want….
A computer!Server space!Processing!
![Page 3: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/3.jpg)
What’s going on here?
Today we’re going to….• Examine our data resources!• Try some scraping!• Try some pulling!• Mess around with an API!• Say hello to visualization!
![Page 4: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/4.jpg)
Data? I hardly knew-a!
• Data: Any discreet unit and its meta information
• Useful data: More than one record of data...but that second record can be in your head!
• Everything is numbers!
![Page 5: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/5.jpg)
Internal Use
![Page 6: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/6.jpg)
External Use
![Page 7: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/7.jpg)
Tell me more of this data of which you speak!
Real-time• Blogs• Twitter feed• News feeds…• Etc
Static data sets• Gov’t census data,• EPA data• National League salaries• etc
![Page 8: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/8.jpg)
Data is Powerful!
The act of measuring something solidifies its state.
Ahh, the power!!!
![Page 9: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/9.jpg)
Data is misleading!
• Choosing one source over another
• Only portraying parts of the statistic
• Choosing a biased method of portrayal
![Page 10: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/10.jpg)
![Page 11: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/11.jpg)
Information Overload: don’t believe the hype
![Page 12: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/12.jpg)
Flavors of data
• Indexed data - documents, weblogs, images, videos, shopping articles, jobs ...
• Cartographic and geographic data -Geolocation software, Geovisualization
• News Aggregators - Feeds, podcasts:
![Page 13: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/13.jpg)
DATATYPE!
• Straight text
• CSV/ tab delimited
• XML/RSS/ATOM
• JSON
Would it fit in here? Then its data!
![Page 14: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/14.jpg)
VIA
• Text file• Data feed• Scraping html• API• Some combination
Could it potentially be transferred by this? Then it’s grabable!
![Page 15: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/15.jpg)
DESTINATION
• Spreadsheet (by hand)
• Browser (direct, javascript, php, perl...)
• Database (via sql using php, perl, etc....)
• Application (Processing, java, python)
• A second API
![Page 16: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/16.jpg)
Mom, where does data come from?HTML for scraping: Anywhere you can see text online• Weather.com• Yahoo trending topics
Preformatted data sets: Anywhere it’s available• Amazon data sets• opendata.gov
Realtime rss feeds: Anywhere there’s a data feed• Any blog feed• Any news feed
Personalized Awesome targeted data: Anywhere with an API. • New York times API• Twitter API
![Page 17: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/17.jpg)
Choose wisely!
DATATYPE VIA DESTINATION• xml/rss Browser Excel• csv text file php: database• xml api php:browser• xml api javascript:browser• html scraping php browser• csv text file Processing• html scraping Processing• xml browser Processing
(through php)
![Page 18: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/18.jpg)
Less talk! More scrape!
Download this! http://www.binaryspark.com/Scrape/examples.zip
As we work, upload here for testing (if you don’t already have space)
ftp.binaryspark.comLogin: [email protected]: zoerocks10Files will appear at:http://www.binaryspark.com/Scrape/classwork/YOURFILESNAME.php
![Page 19: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/19.jpg)
Example 1 and 2
Datatype VIA DestinationHTML SCRAPING BROWSER(Weather info) (PHP) (Firefox, or whatever)
• Step one: Get to know your data: http://www.weather.com/weather/today/New+York+NY+10010?lswe=10010
• Step two: Set up the code
![Page 20: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/20.jpg)
![Page 21: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/21.jpg)
Example 1: Straight scrapin’
<?php
$url = 'http://www.weather.com/weather/today/New+York+NY+10010?lswe=10010';
$output = file_get_contents($url);echo $output;
?>
Get the data!
Do Something with it!
![Page 22: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/22.jpg)
Example 1
<?php
$url = 'http://www.weather.com/weather/today/New+York+NY+10010?lswe=10010';
$output = file_get_contents($url);echo $output;
?>
![Page 23: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/23.jpg)
Example 2: Scraping with a purpose$currentTerm = NULL; //we'll use this to hold the words!$myUrl = "http://www.google.com/trends/hottrends/atom/hourly$searchForStart = "sa=X\">"; $searchForEnd = "</a>";
$rawPage = file_get_contents($myUrl);
echo "<B>These are this hour's trending topics on Google!</b><BR><BR>"; while ($startPos = (strpos($rawPage, $searchForStart))) { //as long as there's more stuff to find, find it!
$endPos = strpos($rawPage, $searchForEnd); //And then find where it ends! $length = $endPos - $startPos; //How long is this string we've found,
anyway? if ($startPos && $endPos) { //Did we find something? Then
$currentTerm = substr($rawPage, ($startPos+strlen($searchForStart)), $length-6); echo $currentTerm . "<BR>";
} //end if $rawPage = substr($rawPage, ($endPos + 4));} //end while
Get the data!
Do Something with it!
Get everything ready
![Page 24: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/24.jpg)
Example 2$currentTerm = NULL; //we'll use this to hold the words!$myUrl = "http://www.google.com/trends/hottrends/atom/hourly$searchForStart = "sa=X\">"; $searchForEnd = "</a>";
$rawPage = file_get_contents($myUrl);
echo "<B>These are this hour's trending topics on Google!</b><BR><BR>"; while ($startPos = (strpos($rawPage, $searchForStart))) { //as long as there's more stuff to find, find it!
$endPos = strpos($rawPage, $searchForEnd); //And then find where it ends! $length = $endPos - $startPos; //How long is this string we've found,
anyway? if ($startPos && $endPos) { //Did we find something? Then
$currentTerm = substr($rawPage, ($startPos+strlen($searchForStart)), $length-6); echo $currentTerm . "<BR>";
} //end if $rawPage = substr($rawPage, ($endPos + 4));} //end while
![Page 25: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/25.jpg)
Example 3
Datatype VIA DestinationXML RSS FEED BROWSER(Huffington post) (PHP) (Firefox, or whatever)
• Step one: Get to know your data: http://feeds.huffingtonpost.com/huffingtonpost/raw_feed
• Step two: Set up the code
![Page 26: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/26.jpg)
What’s this xml stuff?<introductory tags><entry>
<title></title><id></id> <published></published> <updated>2010-06-19T15:50:45Z</updated> <summary>summary> <author>
<name></name> <uri>http://www.huffingtonpost.com/anne-naylor/</uri>
</author> <content></ content>
</entry>
![Page 27: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/27.jpg)
Example 3: XML makes things awesome$url = "http://feeds.huffingtonpost.com/huffingtonpost/raw_feed";$data = file_get_contents($url);
$xml = new SimpleXmlElement($data);
echo "<b>Here are the current popular posts from Huffington Post without the ads!</b><BR><BR><ul>";
foreach ($xml->entry as $item) { //navigate to the tag we want?
$myTitle = "unknown"; //initialize the variable so it's all set!
$myTitle = trim($item->title); echo "<LI>" . $myTitle . "<br>"; //now print it!
}//end foreachecho "</ul>";
Get the data!
Do Something with it!
Get it in a form we can use
![Page 28: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/28.jpg)
Example 3: XML makes things awesome$url = "http://feeds.huffingtonpost.com/huffingtonpost/raw_feed";$data = file_get_contents($url);
$xml = new SimpleXmlElement($data);
echo "<b>Here are the current popular posts from Huffington Post without the ads!</b><BR><BR><ul>";
foreach ($xml->entry as $item) { //navigate to the tag we want?
$myTitle = "unknown"; //initialize the variable so it's all set!
$myTitle = trim($item->title); echo "<LI>" . $myTitle . "<br>"; //now print it!
}//end foreachecho "</ul>";
![Page 29: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/29.jpg)
Example 4
Datatype VIA DestinationXML data FEED API and
BROWSER(US Exchange rates) (PHP) (Google Charts API
Firefox, or whatever)
• Step one: Get to know your data: http://rss.timegenie.com/forex.xml
• Step two: Set up the code
![Page 30: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/30.jpg)
API’s? Eh?
• Data –all the types of data we discussed before
• Functionality Data converters: language translators, speech processing, url shorteners)
Communication: email, IM, notifications
Visual data rendering: Information visualization, diagrams, maps
Security related : electronic payment systems, ID identification...
![Page 31: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/31.jpg)
Example 4: Doing the two-stepGet the data!
Run it through a second Process
Get it in a form we can use
Do something with it (like displaying that baby!)
![Page 32: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/32.jpg)
Bringing data into a higher-level Application…like processing!
• Install the simplml library:http://www.learningprocessing.com/tutorials/simpleml/
• Inspect your data for structure• Write some code!– Declare your xml intent!– Make the request!– Process the request!– Do fun stuff with it!
![Page 33: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/33.jpg)
![Page 34: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/34.jpg)
![Page 35: Art of the scrape!!!! Show the internet who’s boss. Scrape it!](https://reader036.vdocuments.net/reader036/viewer/2022062421/56649c9b5503460f9495955e/html5/thumbnails/35.jpg)