a talk on data analysis and the choice of statistical software · and the choice of statistical...
TRANSCRIPT
4/09/2017 1
A Talk on Data Analysis and the Choice of Statistical Software
John Xie
Statistics Support Officer, Quantitative Consulting Unit,Research Office, Charles Sturt University, NSW, AustraliaEmail: [email protected]
4/09/2017 2
Outline: • Some interesting questions about statistical data
analysis
• A t-test and a simple OLM example – using Excel and R
• Treatment of your raw data ready for analysis
• SAS, Splus, SPSS, Stata, Minitab, Matlab, Statistica, and R/RStudio
• My R story and conclusions
4/09/2017 3
Some interesting questions about statistical data analysis: • Qualitative versus quantitative analysis
• What makes statistical data analysis unique from any other mathematical models
• Point estimation versus interval estimation
• A true model or a useful model
• The difference between a statistician and a non-statistician in data analysis
4/09/2017 4
A big picture about DATA ANALYSIS
Why things get complicated/nasty with statistical data analysis
A number question: 3 + 2 = 5
An equation with one unknown: x + 2 = 5
An equation with two unknowns: x + y = 5
A mathematical function: y = 5 – x
X is a random variable and Y = f(X): Y = 5 – X , X ~ normal(mean=2, sd = 1)
4/09/2017 5
y = f(x)
x can now be called a ‘variable’
x and y are two variables
A t-test example (paired data): the Excel approach
4/09/2017 6
A t-test example (paired data): the R approach
4/09/2017 7
4/09/2017 8
A simple ordinary linear regression model on Biochemical Oxygen Demand (BOD)
BOD5 FM-BOD Excel output
y = 0.8135x + 16.308R² = 0.903
0
200
400
600
800
1000
1200
1400
0 200 400 600 800 1000 1200 1400 1600 1800
Association between FM-BOD and BOD5
4/09/2017 9
A simple ordinary linear regression model on Biochemical Oxygen Demand (BOD)
0 500 1000 1500
02
00
40
06
00
80
01
00
0
Correlation between BOD5 and FM-BOD:mean line band
BOD5
FM
_B
OD
4/09/2017 10
0 500 1000 1500
02
00
40
06
00
80
01
00
0
Correlation between BOD5 and FM-BOD: prediction band
BOD5
FM
_B
OD
• The first row of each worksheet should be the header row.
• Each measurement or variable value should occupy one cell and one cell only of the Excel spreadsheet, e.g., one numeric value or a single word in each cell. Do not mix different variable values in one cell.
• An extra column may be added for remarks if necessary.
• The notation ‘NA’ is recommended for missing value or Not Applicable cases.
• It is a good practice to have a meta data page or file for definition or explanation of the data set.
4/09/2017 11
Data recording and treatment before analysis
Data recording and treatment before analysis
4/09/2017 12
4/09/2017 13
A typical statistical data analysis procedure:
Raw data and treatment
Exploratory Data Analysis (EDA)
Hypothesis and model specification
Parameter estimation and goodness-of-fit check and diagnostic check
Analysis result presentation
4/09/2017 14
A statistician will try everything he/she can to provide an interval estimation for his/her answer in data analysis!
4/09/2017 15
“Essentially, all models are wrong,
but some are useful.”
--- Box, George E. P.; Norman R. Draper (1987).
Empirical Model-Building and Response Surfaces,
p. 424, Wiley. ISBN 0471810339.
4/09/2017 16
Type I and Type II errors
Disease Status
Screen Status Disease No Disease
Test + √ Type I error
Test - Type II error √
4/09/2017 17
Avoid the Type III error:
• It is better to have a rough answer to a right question than to
• have a precise answer to a wrong question !
4/09/2017 18
What statistical software that I could or I should choose for my data analyses ?
4/09/2017 19
4/09/2017 20
The software was originally named the Statistical Package for the Social Sciences (SPSS).
4/09/2017 21
4/09/2017 22
4/09/2017 23
4/09/2017 24
4/09/2017 25
• S-PLUS is a commercial implementation of the S programming language sold by TIBCO Software Inc..
• It features object-oriented programming capabilities and advanced analytical algorithms.
( From Wikipedia, the free encyclopediaaccessed May 2014)
4/09/2017 26
4/09/2017 27
4/09/2017 28
STATISTICA in 2014
4/09/2017 29
STATISTICA in 2017
4/09/2017 30
4/09/2017 31
Summary for the different statistical packages
we have talked about today:
• Excel is not for professional statistical data analysis. If you really need and are good at Excel, use a professional statistics analysis add-in such as XLSTAT
• Minitab and SPSS are (arguably?) very user friendly (GUI + help functions)
• SAS, powerful functions, fast and can hand big data (millions lines of data entries) well, very expensive, not as easy use as Minitab or SPSS
• Splus and R both are variants of S language, Splus is not free
• Stata is a powerful statistical package in biostatistics
• Matlab is a analytic package with powerful statistical functions, fast, excellent program debug functionality
4/09/2017 32
Conclusions:
• For a valid and meaningful data analysis, any one of the above mentioned software tools can do a good job
• Without a good understanding of either your data or statistics, it is less likely you will conduct a good data analysis no matter what package you may use
• The selection of a specific analysis package depends on many factors: your knowledge background, learning objective (e.g., stick to the software that you already known or learn a new software from the beginning), cost , etc.
• My personal choice -- R
4/09/2017 33
Why R ?
• A large proportion of the world’s leading statisticians use R and many of the top non-statistician professionals switched to R.
• More and more people are reporting their research results in the context of R in literature.
• Its unrivalled coverage and the availability of new, cutting-edge applications in a vast range of fields.
• Its quality of back-up and support available.
• It is free and will be free in the future!
4/09/2017 34
The originators of R statistical system:Ross Ihaka and Robert Gentleman and My R Story
Photos are downloaded from the following websites: https://www.stat.auckland.ac.nz/~ihaka/
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS0-hyOymXTAyonhiD_VB3BONbYkSGKV8vOe5pujjr6Mf9KBClP
https://www.computerhope.com/people/ross_ihaka.htm
4/09/2017 35
4/09/2017 36
• R base package and many other special application packages can be downloaded from
• URL http://www.R-project.org/
• or simply google search the key word CRAN
4/09/2017 37
R user interface: R console or Rstudio or R commander
4/09/2017 38
4/09/2017 39
4/09/2017 40
The command line is the best tool for statistical data analysis !• Using command line requires you to develop a habit of planning, running, and
preserving your statistical analyses.
• Operationally, using command line, if not easier and quicker, is, or could be as simple/easy as by pointing and clicking on menus.
• The command line is more flexible, adaptive, and powerful (e.g., for relational understanding, for drawing graphs, etc.)
• For the purpose of reproducibility of your research results (e.g., for the integrity of peer review), a command line statistical program makes documentation natural and efficient.
• For people who have only used menu-driven statistics software before, there will be some significant readjustment. However, the future ethical and time saving advantages are worth the inconvenience.
• Conclusion: use R console or RStudio if you decide learning R!
Get started with R and RStudio
“R and R Studio are separate packages. You will need to install R first.
R is the basic package we are using. R Studio is an add-on that make R easier to use for beginners.”
A quote from (accessed 18 August 2017):
http://www.ics.uci.edu/~jutts/110/InstallingRandRStudio.pdf
4/09/2017 41
4/09/2017 42
Thank You for you attention!
Welcome for questions and/or comments.
John Xie contact details: Email: [email protected]: +61-2-69332229