baseball parallel coordinates step by stepunixlab.sfsu.edu/~trogu/523/2016/.../baseball... ·...

5
PARALLEL COORDINATES baseball - game percentage top 15 teams in 2011 (from highest to lowest) and 2012 This exercise is based on the parallel coordinates plot described in Chapter 7: Spotting Differences (pp. 251-258) of the book Visualize This. However it uses the plotrix package instead of the lattice package described in the chapter (both are packages used in R that have similar uses). The final complete code can be found here. Gather data from one of many baseball sites. In this case the ESPN site, for baseball standings as of 10/31/2011: http://www.espn.com/mlb/standings/_/season/2011 and 10/31/2012: http://www.espn.com/mlb/standings/_/season/2012 Simply select the tables to include the headers (American League, etc.) and paste into Excel. You will have to clean up the data in Excel: remove x and other stuff in front of team names, and also remove all headers except the stats headers. Add “team” label in header for team column In Excel, create a file with multiple worksheets so you can keep your data organized. For this exercise, you will need the team names, and the percentage number (PCT). Above 0.500 means a team has won more games than it has lost. Copy and paste the PCT columns from both 2011 and 2012 into a new worksheet. In the new worksheet, I made the decimal 3 digits (format > cells > number > decimal places > 3) to keep the numbers looking the same. Sort the “percent_2011” column from highest to lowest, make sure to expand selection before sorting. For the exercise, we’ll only use the top 15 teams, as shown above. DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012 Wed. Nov. 14, 2012 baseball parallel coordinates – step by step Page 1 of 5

Upload: others

Post on 08-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: baseball parallel coordinates step by stepunixlab.sfsu.edu/~trogu/523/2016/.../baseball... · bumpchart(baseball_top15,main="Baseball game percentages - Top 15 teams in 2011") The

PARALLEL COORDINATES

baseball - game percentagetop 15 teams in 2011 (from highest to lowest) and 2012

This exercise is based on the parallel coordinates plot described in Chapter 7: Spotting Differences (pp. 251-258) of the book Visualize This. However it uses the plotrix package instead of the lattice package described in the chapter (both are packages used in R that have similar uses).

The final complete code can be found here.

Gather data from one of many baseball sites. In this case the ESPN site, for baseball standings as of 10/31/2011:http://www.espn.com/mlb/standings/_/season/2011

and 10/31/2012:http://www.espn.com/mlb/standings/_/season/2012

Simply select the tables to include the headers (American League, etc.) and paste into Excel. You will have to clean up the data in Excel: remove x and other stuff in front of team names, and also remove all headers except the stats headers. Add “team” label in header for team column

In Excel, create a file with multiple worksheets so you can keep your data organized. For this exercise, you will need the team names, and the percentage number (PCT). Above 0.500 means a team has won more games than it has lost. Copy and paste the PCT columns from both 2011 and 2012 into a new worksheet.

In the new worksheet, I made the decimal 3 digits (format > cells > number > decimal places > 3) to keep the numbers looking the same.

Sort the “percent_2011” column from highest to lowest, make sure to expand selection before sorting.

For the exercise, we’ll only use the top 15 teams, as shown above.

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! ! ! ! Wed. Nov. 14, 2012baseball parallel coordinates – step by step! ! ! ! ! !

Page 1 of 5

ptrogu
Typewritten Text
Last updated: 2016-09-07
ptrogu
Typewritten Text
ptrogu
Typewritten Text
ptrogu
Typewritten Text
ptrogu
Typewritten Text
ptrogu
Typewritten Text
ptrogu
Typewritten Text
This file is located at this URL: http://unixlab.sfsu.edu/~trogu/523/2016/tutorials/parallel_coordinates/
Page 2: baseball parallel coordinates step by stepunixlab.sfsu.edu/~trogu/523/2016/.../baseball... · bumpchart(baseball_top15,main="Baseball game percentages - Top 15 teams in 2011") The

Although we should be able to import the dataset directly into R, we will write the data directly into the code instead. This will require two things:1. The numbers from the percent columns need to be written out sequentially, without quotes, and also

separated by commas. Read numbers left to right and top to bottom, in sequence. 2. The team names need to be in quotes and separated by commas.

You can do cut-and-paste from Excel into text wrangler or Notepad++ and then do simple search-replace-all to get rid of spaces, returns, and add quotes and commas. You should end up with something like this:

for number 1:0.630,0.500,0.599,0.586,0.593,0.512,0.593,0.574,0.586,0.543,0.580,0.500,0.562,0.556,0.556,0.426,0.556,0.543,0.549,0.580,0.531,0.549,0.531,0.580,0.509,0.531,0.500,0.451,0.497,0.605

for number 2:"Philadelphia","NY Yankees","Milwaukee","Texas","Detroit","Arizona","Tampa Bay","Boston","St. Louis","Atlanta","LA Angels","San Francisco","LA Dodgers","Toronto","Washington"

Open RStudio and install the “plotrix” package. Use the install packages button in the plot frame in the RStudio environment, as shown at left. Make sure to “check” the package after installation to activate it.

Start a new R script (top-left corner, as shown).

Load the data, paste this into the script:

# baseball - game percentage# top 15 teams in 2011 (from highest to lowest) and 2012 baseball_top15<-matrix(c(0.630,0.500,0.599,0.586,0.593,0.512,0.593,0.574,0.586,0.543,0.580,0.500,0.562,0.556,0.556,0.426,0.556,0.543,0.549,0.580,0.531,0.549,0.531,0.580,0.509,0.531,0.500,0.451,0.497,0.605),ncol=2,byrow=TRUE)

rownames(baseball_top15)<-c("Philadelphia","NY Yankees","Milwaukee","Texas","Detroit","Arizona","Tampa Bay","Boston","St. Louis","Atlanta","LA Angels","San Francisco","LA Dodgers","Toronto","Washington")

Run the script. The script tells R to load the data as a matrix having rows arranged in two columns. The numbers are read as sets of two, for example the first pair is 0.630,0.500 for Philadelphia. The names of the teams are assigned as the names of the rows.

colnames(baseball_top15)<-c(2011,2012)

The above labels the Y axis on the left (2011) and on the right (2012). Nothing is plotted yet.

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! ! ! ! Wed. Nov. 14, 2012baseball parallel coordinates – step by step! ! ! ! ! !

Page 2 of 5

Page 3: baseball parallel coordinates step by stepunixlab.sfsu.edu/~trogu/523/2016/.../baseball... · bumpchart(baseball_top15,main="Baseball game percentages - Top 15 teams in 2011") The

Next write and run this:

bumpchart(baseball_top15,main="Baseball game percentages - Top 15 teams in 2011")

The bumpchart function in plotrix plots the data but simply ranks each team from highest to lowest on both columns and connects the two points with a line. This is actually quite useful when all that is needed is the comparison between the teams, not the actual percentages. R should plot the graph shown below:

Notice that some teams have the same percentages and share the same dot. However, following the lines to the right side to the corresponding team name, clarifies which line is which.

This is great: teams, rankings, and labels all with one line of code!

With the next commands, we’ll plot the actual percentages and add a scale:

# now show the raw percentages and add ticks on left side (1.1) and right side (1.9)bumpchart(baseball_top15,rank=FALSE, main="Major league baseball - Game percentages - Top 15 teams 2011",col=rainbow(5))

The rank=FALSE returns the actual percentages. The col=rainbow(5) colors the lines with 5 random colors which we’ll modify later in Illustrator. Running the script changes the graph quite radically, but we still need the scale, which we’ll add next.

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! ! ! ! Wed. Nov. 14, 2012baseball parallel coordinates – step by step! ! ! ! ! !

Page 3 of 5

Page 4: baseball parallel coordinates step by stepunixlab.sfsu.edu/~trogu/523/2016/.../baseball... · bumpchart(baseball_top15,main="Baseball game percentages - Top 15 teams in 2011") The

# margins have been reset, so usepar(xpd=TRUE)

boxed.labels(1.1,seq(0.320,0.640,by=0.020),seq(0.320,0.640,by=0.020))boxed.labels(1.9,seq(0.320,0.640,by=0.020),seq(0.320,0.640,by=0.020))

par(xpd=FALSE)

Not sure about the par command, but the boxed.labels command adds the tick mark labels. The 1.1 tells it to place the labels just to the right of the 2011 column on the left. And the 1.9 in the second set of labels tells it to place the labels just to the left of the 2012 column on the right. Using 1 will put the labels exactly under the left column, and using 2 will put the labels exactly under the right column. The by=0.020 tells it to space the labels every 0.020. See the screenshot below for the result:

Next, export the plot as a PDF file and open it in Illustrator. Before you do anything else, make sure to release any clipping mask, to allow you to edit the file more easily: Select all > Object > Clipping Mask > Release.

This will allow you to select individual objects, such as the dots which will likely have become squares in the transfer. Remember the bug about the missing font? (You are missing Adobe Pi Std or something like that). Select all the squares (easier to do in “Preview mode” -- View > Preview or Command-Y) and change the font to Zapf Dingbats. This will make them dots (circles) again. Note that every dot is made up of a separate fill and a separate border, you might want to clean that up too.!

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! ! ! ! Wed. Nov. 14, 2012baseball parallel coordinates – step by step! ! ! ! ! !

Page 4 of 5

Page 5: baseball parallel coordinates step by stepunixlab.sfsu.edu/~trogu/523/2016/.../baseball... · bumpchart(baseball_top15,main="Baseball game percentages - Top 15 teams in 2011") The

In Illustrator, I deleted all unnecessary boxes. I moved the labels and scales outside the main rectangle, which I made a very light gray. I made all lines except three a dark gray, as well as the dots. I highlighted the best and the worst record from 2011 to 2012 (green=best, red=worst) and made San Francisco blue and thicker. The rest was just fine-tuning. The result is below:

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! ! ! ! Wed. Nov. 14, 2012baseball parallel coordinates – step by step! ! ! ! ! !

Page 5 of 5

O

O

Major leaJXH�EDVHEDOO�ï�*DPH�SHrcentagHV�ï�TRS����WHDPV

OO

O

O

OO

O

OO

O

O

O

OO

O

O

O

OO

OO

O

O

O

O

O

���� ����

���� ����

WashingtonToronto

LA Dodgers

San FranciscoLA Angels

Atlanta

St. LouisBoston

Tampa Bay

Arizona

Detroit

TexasMilwaukee

NY Yankees

Philadelphia

Washington

Toronto

LA Dodgers

San Francisco

LA Angels

Atlanta

St. Louis

Boston

Tampa Bay

Arizona

Detroit

Texas

Milwaukee

NY Yankees

Philadelphia

0.420

0.440

0.460

0.480

0.500

0.520

0.540

0.560

0.580

0.600

0.620

0.640

0.420

0.440

0.460

0.480

�����

�����

�����

�����

�����

0.600

0.620

0.640