a talk on data analysis and the choice of statistical software · and the choice of statistical...

42
4/09/2017 1 A Talk on Data Analysis and the Choice of Statistical Software John Xie Statistics Support Officer, Quantitative Consulting Unit, Research Office, Charles Sturt University, NSW, Australia Email: [email protected]

Upload: buitram

Post on 26-May-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 1

A Talk on Data Analysis and the Choice of Statistical Software

John Xie

Statistics Support Officer, Quantitative Consulting Unit,Research Office, Charles Sturt University, NSW, AustraliaEmail: [email protected]

Page 2: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 2

Outline: • Some interesting questions about statistical data

analysis

• A t-test and a simple OLM example – using Excel and R

• Treatment of your raw data ready for analysis

• SAS, Splus, SPSS, Stata, Minitab, Matlab, Statistica, and R/RStudio

• My R story and conclusions

Page 3: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 3

Some interesting questions about statistical data analysis: • Qualitative versus quantitative analysis

• What makes statistical data analysis unique from any other mathematical models

• Point estimation versus interval estimation

• A true model or a useful model

• The difference between a statistician and a non-statistician in data analysis

Page 4: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 4

A big picture about DATA ANALYSIS

Page 5: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

Why things get complicated/nasty with statistical data analysis

A number question: 3 + 2 = 5

An equation with one unknown: x + 2 = 5

An equation with two unknowns: x + y = 5

A mathematical function: y = 5 – x

X is a random variable and Y = f(X): Y = 5 – X , X ~ normal(mean=2, sd = 1)

4/09/2017 5

y = f(x)

x can now be called a ‘variable’

x and y are two variables

Page 6: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

A t-test example (paired data): the Excel approach

4/09/2017 6

Page 7: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

A t-test example (paired data): the R approach

4/09/2017 7

Page 8: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 8

A simple ordinary linear regression model on Biochemical Oxygen Demand (BOD)

BOD5 FM-BOD Excel output

y = 0.8135x + 16.308R² = 0.903

0

200

400

600

800

1000

1200

1400

0 200 400 600 800 1000 1200 1400 1600 1800

Association between FM-BOD and BOD5

Page 9: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 9

A simple ordinary linear regression model on Biochemical Oxygen Demand (BOD)

0 500 1000 1500

02

00

40

06

00

80

01

00

0

Correlation between BOD5 and FM-BOD:mean line band

BOD5

FM

_B

OD

Page 10: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 10

0 500 1000 1500

02

00

40

06

00

80

01

00

0

Correlation between BOD5 and FM-BOD: prediction band

BOD5

FM

_B

OD

Page 11: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

• The first row of each worksheet should be the header row.

• Each measurement or variable value should occupy one cell and one cell only of the Excel spreadsheet, e.g., one numeric value or a single word in each cell. Do not mix different variable values in one cell.

• An extra column may be added for remarks if necessary.

• The notation ‘NA’ is recommended for missing value or Not Applicable cases.

• It is a good practice to have a meta data page or file for definition or explanation of the data set.

4/09/2017 11

Data recording and treatment before analysis

Page 12: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

Data recording and treatment before analysis

4/09/2017 12

Page 13: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 13

A typical statistical data analysis procedure:

Raw data and treatment

Exploratory Data Analysis (EDA)

Hypothesis and model specification

Parameter estimation and goodness-of-fit check and diagnostic check

Analysis result presentation

Page 14: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 14

A statistician will try everything he/she can to provide an interval estimation for his/her answer in data analysis!

Page 15: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 15

“Essentially, all models are wrong,

but some are useful.”

--- Box, George E. P.; Norman R. Draper (1987).

Empirical Model-Building and Response Surfaces,

p. 424, Wiley. ISBN 0471810339.

Page 16: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 16

Type I and Type II errors

Disease Status

Screen Status Disease No Disease

Test + √ Type I error

Test - Type II error √

Page 17: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 17

Avoid the Type III error:

• It is better to have a rough answer to a right question than to

• have a precise answer to a wrong question !

Page 18: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 18

What statistical software that I could or I should choose for my data analyses ?

Page 19: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 19

Page 20: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 20

The software was originally named the Statistical Package for the Social Sciences (SPSS).

Page 21: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 21

Page 22: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 22

Page 23: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 23

Page 24: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 24

Page 25: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 25

• S-PLUS is a commercial implementation of the S programming language sold by TIBCO Software Inc..

• It features object-oriented programming capabilities and advanced analytical algorithms.

( From Wikipedia, the free encyclopediaaccessed May 2014)

Page 26: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 26

Page 27: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 27

Page 28: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 28

Page 29: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

STATISTICA in 2014

4/09/2017 29

Page 30: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

STATISTICA in 2017

4/09/2017 30

Page 31: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 31

Summary for the different statistical packages

we have talked about today:

• Excel is not for professional statistical data analysis. If you really need and are good at Excel, use a professional statistics analysis add-in such as XLSTAT

• Minitab and SPSS are (arguably?) very user friendly (GUI + help functions)

• SAS, powerful functions, fast and can hand big data (millions lines of data entries) well, very expensive, not as easy use as Minitab or SPSS

• Splus and R both are variants of S language, Splus is not free

• Stata is a powerful statistical package in biostatistics

• Matlab is a analytic package with powerful statistical functions, fast, excellent program debug functionality

Page 32: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 32

Conclusions:

• For a valid and meaningful data analysis, any one of the above mentioned software tools can do a good job

• Without a good understanding of either your data or statistics, it is less likely you will conduct a good data analysis no matter what package you may use

• The selection of a specific analysis package depends on many factors: your knowledge background, learning objective (e.g., stick to the software that you already known or learn a new software from the beginning), cost , etc.

• My personal choice -- R

Page 33: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 33

Why R ?

• A large proportion of the world’s leading statisticians use R and many of the top non-statistician professionals switched to R.

• More and more people are reporting their research results in the context of R in literature.

• Its unrivalled coverage and the availability of new, cutting-edge applications in a vast range of fields.

• Its quality of back-up and support available.

• It is free and will be free in the future!

Page 34: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 34

The originators of R statistical system:Ross Ihaka and Robert Gentleman and My R Story

Photos are downloaded from the following websites: https://www.stat.auckland.ac.nz/~ihaka/

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS0-hyOymXTAyonhiD_VB3BONbYkSGKV8vOe5pujjr6Mf9KBClP

https://www.computerhope.com/people/ross_ihaka.htm

Page 35: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 35

Page 36: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 36

• R base package and many other special application packages can be downloaded from

• URL http://www.R-project.org/

• or simply google search the key word CRAN

Page 37: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 37

R user interface: R console or Rstudio or R commander

Page 38: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 38

Page 39: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 39

Page 40: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 40

The command line is the best tool for statistical data analysis !• Using command line requires you to develop a habit of planning, running, and

preserving your statistical analyses.

• Operationally, using command line, if not easier and quicker, is, or could be as simple/easy as by pointing and clicking on menus.

• The command line is more flexible, adaptive, and powerful (e.g., for relational understanding, for drawing graphs, etc.)

• For the purpose of reproducibility of your research results (e.g., for the integrity of peer review), a command line statistical program makes documentation natural and efficient.

• For people who have only used menu-driven statistics software before, there will be some significant readjustment. However, the future ethical and time saving advantages are worth the inconvenience.

• Conclusion: use R console or RStudio if you decide learning R!

Page 41: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

Get started with R and RStudio

“R and R Studio are separate packages. You will need to install R first.

R is the basic package we are using. R Studio is an add-on that make R easier to use for beginners.”

A quote from (accessed 18 August 2017):

http://www.ics.uci.edu/~jutts/110/InstallingRandRStudio.pdf

4/09/2017 41

Page 42: A Talk on Data Analysis and the Choice of Statistical Software · and the Choice of Statistical Software ... applications in a vast range of fields. ... The command line is the best

4/09/2017 42

Thank You for you attention!

Welcome for questions and/or comments.

John Xie contact details: Email: [email protected]: +61-2-69332229