data gleaning tutorial d3 json csv - san francisco state...

9
DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2015 Last updated: Friday, October 9, 2015 Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV – Step by Step Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV Step by Step All the files related to this tutorial can be found at: http://unixlab.sfsu.edu/~trogu/523/2015/data_gleaning_tutorial/ In this exercise we’ll “glean” a data set directly from the html page source code of an article and interactive visualization on oil prices and production (Fig. 1). The article’s URL is here: http://www.nytimes.com/interactive/2015/09/30/business/how-the-us-and-opec-drive-oil-prices.html I use the term “gleaning” as opposed to “scraping” — the term used when scripts and other web robots or programs are used to gather data from html pages — because it reminds me of when I was a kid and did all kinds of gleaning: artichoke fields, asparagus, snails, etc. The article is titled: How the U.S. and OPEC Drive Oil Prices By Jeremy Ashkenas, Alicia Parlapiano and Hannah Fairfield, October 5, 2015 Fig. 1 Various, customized views of the data can be see by toggling the buttons on the right column. in this view, I selected the “monthly” oil production and prices. Like many large content management systems (CMS), the New York Times website is constantly updated by hundreds of different people and thus the pages are not individually composed but rather many different pieces are edited and assembled on the fly at “run time”. Because of the complexity and the risk of broken links, often the final page will include in its page source the actual data sets that are used in the visualization — see Fig. 2 to find page source of a web page. The data sets will not be a traditional Page of 1 9

Upload: lamque

Post on 21-Apr-2018

230 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: data gleaning tutorial d3 json csv - San Francisco State ...unixlab.sfsu.edu/.../tutorials/data_gleaning_tutorial_d3_json_csv.pdf · not require special plug-ins like Flash for example

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2015 Last updated: Friday, October 9, 2015Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV – Step by Step

Data Gleaning Tutorial: NY Times D3 —> JSON —> CSVStep by StepAll the files related to this tutorial can be found at:http://unixlab.sfsu.edu/~trogu/523/2015/data_gleaning_tutorial/

In this exercise we’ll “glean” a data set directly from the html page source code of an article and interactive visualization on oil prices and production (Fig. 1). The article’s URL is here:http://www.nytimes.com/interactive/2015/09/30/business/how-the-us-and-opec-drive-oil-prices.html

I use the term “gleaning” as opposed to “scraping” — the term used when scripts and other web robots or programs are used to gather data from html pages — because it reminds me of when I was a kid and did all kinds of gleaning: artichoke fields, asparagus, snails, etc.

The article is titled:How the U.S. and OPEC Drive Oil PricesBy Jeremy Ashkenas, Alicia Parlapiano and Hannah Fairfield, October 5, 2015

�Fig. 1 Various, customized views of the data can be see by toggling the buttons on the right column.in this view, I selected the “monthly” oil production and prices.

Like many large content management systems (CMS), the New York Times website is constantly updated by hundreds of different people and thus the pages are not individually composed but rather many different pieces are edited and assembled on the fly at “run time”. Because of the complexity and the risk of broken links, often the final page will include in its page source the actual data sets that are used in the visualization — see Fig. 2 to find page source of a web page. The data sets will not be a traditional

Page � of �1 9

Page 2: data gleaning tutorial d3 json csv - San Francisco State ...unixlab.sfsu.edu/.../tutorials/data_gleaning_tutorial_d3_json_csv.pdf · not require special plug-ins like Flash for example

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2015 Last updated: Friday, October 9, 2015Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV – Step by Step

spreadsheet but rather each data point will be formatted in some standard way that can be read by the program running the visualization.

�Fig. 2 To view the page source, go to Tools —> Web Developer —> Page Source (Browser: Firefox)

Although not pretty to look at, with some patience and a bit of luck, sometimes the data sets will be hidden in just such chaos (Fig. 3), and finding it here will save you a lot of work in searching for the same information elsewhere. Plus, chances are that the data is already logically formatted for the purposes of the visualization. Please, don’t forget to credit the sources in your work, either primary sources like the original data collectors, but also the journalists and designers who later used that data in the articles that you found.

�Fig. 3 Not pretty, but some digging and getting your hands dirtywith this kind of stuff will sometimes offer unexpected rewards.

D3D3 (Data Driven Documents) has become an important tool favored by many data visualization designers and engineers, mainly because of its efficiency, fast rendering, being vector-based and resolution independent, and most importantly, being a javascript library, it will run natively in the browser and does not require special plug-ins like Flash for example. The animations in D3 are very smooth and fast, and require little additional coding after the code has been set up. http://d3js.org/

Page � of �2 9

Page 3: data gleaning tutorial d3 json csv - San Francisco State ...unixlab.sfsu.edu/.../tutorials/data_gleaning_tutorial_d3_json_csv.pdf · not require special plug-ins like Flash for example

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2015 Last updated: Friday, October 9, 2015Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV – Step by Step

JSON(JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999.http://www.json.org/

In the page source of the article, you will see that the data set is formatted in JSON. Although it’s still a bit confusing to look at, upon closer inspection you will see that there is a logic that’s not difficult to grasp. With a bit of patience, one could reconstruct the data set by hand with cut and paste. Luckily the programmer’s community is a great source of useful tools and we’ll use one such took to automatically convert JSON to CSV (comma separated value) files.

How oil prices move up and downIn the NYT scatterplot that displays the two variables of oil price (Y axis) and oil production (X axis), each intersection is a year (or a month). Because oil production in some years is lower than previous years, the line appears to move backward (to the left) however it’s not the years that move backwards, just the production (less oil produced in that year) (Fig 4). See this other graph (Fig. 5) by one of the authors, Hannah Fairfield, on traffic fatalities, that employs the same design — Driving Safety, in Fits and Starts:http://www.nytimes.com/interactive/2012/09/17/science/driving-safety-in-fits-and-starts.html

Fig. 4 Oil Prices graph from Oct. 5, 2015 Fig. 5 Traffic Safety graph from Sept. 17, 2009

Page � of �3 9

Page 4: data gleaning tutorial d3 json csv - San Francisco State ...unixlab.sfsu.edu/.../tutorials/data_gleaning_tutorial_d3_json_csv.pdf · not require special plug-ins like Flash for example

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2015 Last updated: Friday, October 9, 2015Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV – Step by Step

Looking up the page sourceAll browser have the ability to display the html source code of the page that you are seeing. The source code may be more or less complete depending on whether the page employs frames and whether each component is included already or has to be called up from another page like a CSS page, server, etc.

For the reasons stated earlier, the NY Times almost always displays its full code and often the data as well, unless copyright or permission issues prevent it from doing so.In Fig. 6, To look up the page source, (Firefox example) go to Tools —> Web Developer —> Page Source

Fig. 6 Look up the page source in developer Tools (Firefox).

�Fig. 7 In the source code, even though a bit of color coding lets you separate the different types of information, visually it’s quite hard to figure out, especially for basic users.

Copy all the source code and paste it into TextWrangler (Mac) or NotePad++ (PC). Note, this tutorial will always refer to TextWrangler for a text editor, but NotePad++ will work just as well.

Save the code. Note that if you save the code as simple text (Fig 8 — extension .txt) you will not be able to tell easily what are the tags, what’s the html, what’s the text, etc. Whereas if you save it either as html (Fig. 9 — extension .html) or something like javascript (extension .js) — even though technically not everything in it is a script — then the distinction becomes much more clear and you will be able to easily identify the different parts.

Page � of �4 9

Page 5: data gleaning tutorial d3 json csv - San Francisco State ...unixlab.sfsu.edu/.../tutorials/data_gleaning_tutorial_d3_json_csv.pdf · not require special plug-ins like Flash for example

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2015 Last updated: Friday, October 9, 2015Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV – Step by Step

�Fig. 8 Saving the file with extension .txt will not display different colors for the different parts of the source code.

�Fig. 9 However, if you save the file with extension .html or .js, then the different part will be more distinct. In this case, the data itself is either red or blue, as opposed to black

In TextWrangler you can toggle the view of the text display: View —> Text Display —> Soft Wrap Text to more easily find stuff (Fig. 10). Code lines are often super-long and making sense of them is much easier if you can look at them in shorter bits rather than infinitely long strings.

Fig. 10 Switch Text Display to more easily find stuff and to see if there is a break in the lines.

Data convertersI remembered that Shan Carter, another designer at the NYT, had put together a personal page to convert spreadsheet into various web formats like HTML, JSON and XML (Fig. 11):https://shancarter.github.io/mr-data-converter/

Page � of �5 9

Page 6: data gleaning tutorial d3 json csv - San Francisco State ...unixlab.sfsu.edu/.../tutorials/data_gleaning_tutorial_d3_json_csv.pdf · not require special plug-ins like Flash for example

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2015 Last updated: Friday, October 9, 2015Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV – Step by Step

�Fig. 11 Use Mr Data Converter to turn Excel files into HTML, JSON, or XML. Use the sample file (upper right button) to see how the conversion works.

So, in this web site, pasting or importing an Excel file converts it to the above formats. Now, for our purpose we actually need to do the opposite, go from a web format to a spreadsheet or text format, but even though we can’t do this here, I used the sample file on the website to confirm that what I thought was JSON code, was in fact JSON and not something else.

A quick scan of the example confirmed that it was JSON which was good. Most importantly though, the comparison shows me how the parts are transformed from the column format of the spreadsheet to the JSON notation or vice-versa. This will be useful later to make sure we did the conversion right.

A quick search on the web (convert json to csv) found a very simple website that would do the opposite conversion, taking a JSON file and turning into other formats like CSV which is what we want (Fig. 12):http://www.convertcsv.com/json-to-csv.htm

�Fig. 12 Convertcsv.com/json-to-csv.htm. This website will convert your JSON file to a CSV file.

So let’s make a file that is just JSON with the code from the page source.

Page � of �6 9

Page 7: data gleaning tutorial d3 json csv - San Francisco State ...unixlab.sfsu.edu/.../tutorials/data_gleaning_tutorial_d3_json_csv.pdf · not require special plug-ins like Flash for example

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2015 Last updated: Friday, October 9, 2015Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV – Step by Step

I went back to full source code and I isolated the chunk that I thought was a complete data set. Clues were the beginning separate lines indicating “monthly” and “yearly” and the opening and closing square brackets […] at the beginning and end of the set (Fig. 13).

�Fig. 13 JSON data sets begin and end with square brackets. Turn the line numbers ON to help you go back to parts of the text previously browsed. You turn them on in the “T” drop-down menu, which is the same Text Display menu that lets you choose “soft wrap text”.

Copy that chunk of code (begins and ends with the square bracket) and paste into TextWrangler.Save it as something that ends with .js although saving it as .txt will also work. In my case I saved the file as crude_oil_monthly.js

In the JSON/CSV website, import the file using the Browse botton (Choose JSON file here) and finding the file on your computer (Fig 14). Find the file and choose it. Click Convert JSON to CSV (Fig. 15).

� �Fig. 14 Browse and import your JSON file. Fig. 15 Convert your file from JSON to CSV

Now, either save to disk or copy the text generated within the display window — save button is right above window (Fig. 16).

Page � of �7 9

Page 8: data gleaning tutorial d3 json csv - San Francisco State ...unixlab.sfsu.edu/.../tutorials/data_gleaning_tutorial_d3_json_csv.pdf · not require special plug-ins like Flash for example

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2015 Last updated: Friday, October 9, 2015Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV – Step by Step

�Fig. 16 After the conversion, either copy-paste the new text in the window, or save the CSV file to disk.

The result looks already pretty good but to make sure, open the saved CSV file in Excel to see what the header and columns look like (Fig. 17):

�Fig. 17 Opening the CSV file in Excel will let you see if the header, the columns, and the rows all display properly. However, it’s best not to rely on Excel for your primary data sets, therefore quit Excel without saving and go back to the better CSV file (text only file).

This is great as you can immediately see how the complexity of the animated D3 scatterplot comes down to a clean and organized data set. Note how the year, month, and time columns are set up. This is crucial for properly displaying the time data.

Now forget Excel, don’t save, and you are ready to use the comma separated data file as needed (Fig. 18).

Page � of �8 9

Page 9: data gleaning tutorial d3 json csv - San Francisco State ...unixlab.sfsu.edu/.../tutorials/data_gleaning_tutorial_d3_json_csv.pdf · not require special plug-ins like Flash for example

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2015 Last updated: Friday, October 9, 2015Data Gleaning Tutorial: NY Times D3 —> JSON —> CSV – Step by Step

�Fig. 18 Pasting your data into a pure text editor (no formatting allowed) ensures that nothing but basic unicode characters are recorded. That way your data is clean and does not contain hidden characters (except maybe for carriage returns, spaces, etc) that might trip up your code in R for example. Examples of this type of “dirt” could be formulas imported from Excel or extra rows that don’t contain data but contain other invisible junk that again could trip up your script.

I have also saved the “yearly” data set as was found in the code. Since there are more than a dozen separate custom views of the data (scatterplot visualizations) in the NY Times article, I assume that different parts of these two data sets are used at different times in the code.

Now go visualize, have fun, and don’t forget to credit your sources!

Pino TroguSan Francisco State UniversityOctober 9, 2015

**************************

I have uploaded all the files related to this tutorial in this “off-iLearn” directory, that way the files will still be available after the class ends:http://unixlab.sfsu.edu/~trogu/523/2015/data_gleaning_tutorial/

Page � of �9 9