r: statistics for all of us - personality projectpersonality-project.org/r/why.r.pdf · r:...

R: statistics for all of usR: an international statistical collaboratory

Prepared for part of the symposium onMultivariate Statistical Methods in Individual Differences ResearchInternational Society for the Study of Individual Differences Biennial meeting, Adelaide, July , 2005

William Revelle, Northwestern Universitypersonality-project.org/r/

R: statistics for all of us

• What is it?

• Why use it?

• Common (mis)perceptions

• Examples for personality and individual differences research

R: What is it?

•R: An international collaboration

•R: the open source - public domain version of S+

•R: written by statisticians (and all of us) for statisticians (and the rest of us)

•R: an extensible language

Common statistical programs

General SpecializedR AMOSS+ EQSSAS LISRELSPSS Plus

STATA MxSYSTAT your favorite program.

Common statistical programsmost are costly

General SpecializedR AMO$$+ EQ$

$A$ LI$REL$P$$ Maple$

$TATA Mx$Y$TAT your favorite program.

R: a way of thinking(from the R point of view)

• “R is the lingua franca of statistical research. Work in all other languages should be discouraged.”

• “This is R. There is no if. Only how.”

• “Overall, SAS is about 11 years behind R and S-Plus in statistical capabilities (last year it was about 10 years behind) in my estimation.”

Taken from the R.-fortunes (selections from the R.-help list serve)

But it is open source - how can you trust it?

• Q: When you use it [R], since it is written by so many authors, how do you know that the results are trustable?

• A:The R engine [...] is pretty well uniformly excellent code but you have to take my word for that. Actually, you don't. The whole engine is open source so, if you wish, you can check every line of it. If people were out to push dodgy software, this is not the way they'd go about it.

Taken from the R.-fortunes (selections from the R.-help list serve)

What is R? : Technically

• R is an open source implementation of S (S-Plus is a commercial implementation)

• R is available under GNU Copy-left

• The current version of R is 2.1.1

• R is group project run by a core group of developers (with new releases semiannually)

• (Adapted from Robert Gentleman)

R: History• 1991-93: Ross Dhaka and Robert Gentleman begin

work on R project at U. Auckland

• 1995: R available by ftp under the GPL

• 96-97: mailing list and R core group is formed

• 2000: John Chambers, designer of S joins the R core (wins a prize for best software from ACM for S)

• 2001-2005: Core team continues to improve base package

• Many (>400) others contribute “packages”

Why R?

• Graphics for data exploration and interpretation

• Data manipulation including statistics as data

• Statistical analysis

• Standard univariate and multivariate generalizations of the linear model

• Multivariate-structural extensions

Why R? Graphics

• Sample graphics taken from

• http:personality-project.org/r/

• showing what can be done by an amateur

• http://addictedtor.free.fr/graphiques/

• showing some most impressive graphs

m-1.0-1.0M-1.0M

-0.5-0.5M-0.5M

0.00.0M0.0M

0.50.5M0.5M

1.01.0M1.0M

m-1.0-1.0M-1.0M-1

.0-0.5-0.5M-0.5M

0.00.0M0.0M0.0

0.50.5M0.5M0.5

1.01.0M1.0M1.0

factor 1Mfactor 1M

factor 1

factor 2Mfactor 2M

Two dimensions of affectMTwo dimensions of affectM

Two dimensions of affect

sluggishMsluggishM

sluggish

depressedMdepressedM

depressed

satisfiedMsatisfiedM

satisfied

relaxedMrelaxedM

relaxed

blueMblueM

intenseMintenseM

intense

scaredMscaredM

scared

enthusiasticMenthusiasticM

enthusiastic

sadMsadM

full_of_pepMfull_of_pepM

full_of_pep

unhappyMunhappyM

unhappy

livelyMlivelyM

lively

tiredMtiredM

dullMdullM

at_easeMat_easeM

at_ease

guiltyMguiltyM

guilty

excitedMexcitedM

excited

gloomyMgloomyM

gloomy

nervousMnervousM

nervous

grouchyMgrouchyM

grouchy

calmMcalmM

drowsyMdrowsyM

drowsy

alertMalertM

surprisedMsurprisedM

surprised

energeticMenergeticM

energetic

sorryMsorryM

sleepyMsleepyM

sleepy

pleasedMpleasedM

pleased

happyMhappyM

distressedMdistressedM

distressed

contentMcontentM

content

tenseMtenseM

ashamedMashamedM

ashamed

anxiousManxiousM

anxious

cheerfulMcheerfulM

cheerful

Standard Plots of factor loadings

m-0.4-0.4M-0.4M

-0.2-0.2M-0.2M

0.00.0M0.0M

0.20.2M0.2M

0.40.4M0.4M

0.60.6M0.6M

0.80.8M0.8M

1.01.0M1.0M

m-0.4-0.4M-0.4M-0

.4-0.2-0.2M-0.2M

0.00.0M0.0M0

.00.20.2M0.2M

0.40.4M0.4M0

.40.60.6M0.6M

0.80.8M0.8M0

.81.01.0M1.0M

correlations with happyMcorrelations with happyM

correlations with happy

correlations with sadMcorrelations with sadM

correlations with happy and sadMcorrelations with happy and sadM

correlations with happy and sad

sluggishMsluggishM

sluggish

depressedMdepressedM

depressed

satisfiedMsatisfiedM

satisfied

relaxedMrelaxedM

relaxed

blueMblueM

intenseMintenseM

intense

scaredMscaredM

scared

sadMsadM

full_of_pepMfull_of_pepM

full_of_pep

unhappyMunhappyM

unhappy

livelyMlivelyM

lively

tiredMtiredM

dullMdullM

at_easeMat_easeM

at_ease

guiltyMguiltyM

guilty

excitedMexcitedM

excited

gloomyMgloomyM

gloomy

nervousMnervousM

nervous

grouchyMgrouchyM

grouchy

calmMcalmM

drowsyMdrowsyM

drowsy

alertMalertM

surprisedMsurprisedM

surprised

energeticMenergeticM

energetic

sorryMsorryM

sleepyMsleepyM

sleepy

pleasedMpleasedM

pleased

happyMhappyM

distressedMdistressedM

distressed

contentMcontentM

content

tenseMtenseM

ashamedMashamedM

ashamed

anxiousManxiousM

anxious

cheerfulMcheerfulM

cheerful

Data points can be dynamically identified

/Users/Bill/Users/Bill

/Users/Bill

0.0 0.5 1.0

r with 0 degrees

Simulated correlations 90 degrees

/Users/Bill

-1.0 -0.5 0.0 0.5 1.0

r with happy

Observed correlations with happy and nervous

/Users/Bill

-1.0 -0.5 0.0 0.5 1.0

r with 0 degrees

/Users/Bill

-1.0 -0.5 0.0 0.5 1.0

r with happy

Observed correlations with happy and sad

-1.0 -0.5 0.0 0.5 1.0

r with 0 degrees

/Users/Bill

-1.0 -0.5 0.0 0.5 1.0

r with active

ith inactive

Observed correlations with active and inactive

Multi-panel graphs can be labeled separately and organized vertically or

horizontally

Simulated data can be generated to fit normal,

rectangular, binomial, poissen, exponential, etc.

distributions

0 2 4 6 8 10 12

0.85 0.80

0 1 2 3 4 5 6 7

-0.180

0.43 -0.05 -0.22

epiImp

epilie

5 10 15 20 0

2 4 6 8

0 5 10 15 20

20epiNeur

Scatter Plot Matrices can show smoothed fits

0 2 4 6 8 10 12

0.85 0.80

0 1 2 3 4 5 6 7

0.43 -0.052 -0.22

epiImp

-0.074

epilie

5 10 15 20 0 2 4 6 8

0 5 10 15 20

20epiNeur

Can scale font size of correlations by absolute size of r

−1.0 −0.5 0.0 0.5 1.0 1.5

−1.0

−0.5

Movie Study 1

Positive Affect

Threat Neutral

−1.0 −0.5 0.0 0.5 1.0 1.5

−1.0

−0.5

Movie Study 2

Positive Affect

Threat

Neutral

Error bars on two dimensions

−1.0 −0.5 0.0 0.5 1.0 1.5

−1.0

−0.5

PA x NA

Positive Affect

ct Sad

ThreatNeutral

−1.0 −0.5 0.0 0.5 1.0 1.5

−1.0

−0.5

PA x NA

Positive Affect

Threat

Neutral

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

PA x EA

Postive Affect

ThreatNeutral

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

PA x EA

Postive Affect

Threat

Neutral

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

NA x TA

Negative Affect

l SadThreat

NeutralHappy

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

NA x TA

Negative Affect

l SadThreat

Neutral

Study 2 Study 3

Presentation graphicsscale to fit page

and produce output for pdf presentations

Window or page size is controllable

Graphic output can be taken from multiple programs (e.g. Very Simple Structure, factor analysis

5 10 15

Anscombe's 4 Regression data sets

Built-in data sets provide useful demonstrations of stats

Mapping data (GIS) may be combined with descriptive data

Many GIS map files are available for download from ESRI

GIS maps detail regional boundaries

US counties

Counties of California

Median Household Income for 2000

2000 median household income for zip codes

Median Household Income for 2000

Combine income data from 2000 census with zipcode map of San Diego County

100 105 110 115 120 125 130

Borneo and its surrounds

longitude

latitude

GIS files of Borneo can show country boundaries (e.g., Malaysia, Brunei, Indonesia)

100 105 110 115 120 125 130

Borneo and its surrounds

longitude

latitude

Public access GIS files include rivers and roads

Even more graphics

• Taken from a collection of R demonstrations and graphics

• http://addictedtor.free.fr/graphiques/

c(-1, 1)

PCA 5 varsprincomp(x = data, cor = cor)

Fertility

Agriculture

Examination

Education Catholic

(1-3) 60%

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

c(-1, 1)

Clustering 4 groups

80 60 40 20 0

48Groups

V. De Geneve

c(-1, 1)

Factor 1 [41%]

c(-1, 1)

Factor 3 [19%]

Principal components and clustering of sources of variance in USA arrest data

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

Mixture models

2 3 4 5 6 7 8 9 10

Notched Boxplots

Notched Boxplots show confidence regions

Scatter Plot Matrix

Length

setosa

Length

versicolor

Length

virginica

Varieties

Multipanel graphs

Height (inches)

60 65 70 75

Bass 2 Bass 1

60 65 70 75

Tenor 2

Tenor 1

Alto 2

60 65 70 75

Alto 1

Soprano 2

60 65 70 75

Soprano 1

Histograms and fitted distributions

-3 -2 -1 0 1 2 3

Combine scatter plot with histograms

Why R?

Data ManipulationData Entry

• from console

• from clipboard (copied from other programs)

• from file (text files, csv, SPSS, Excel, MySQL)

• from the web

Data Manipulation: Data Structures

• Data types: integer, real, logical, character, string

• Vectors of any data type

• Matrices of any data type

• Data Frames (similar to matrix of mixed type)

• Lists of any mixture of types

• All operations are functions and the returned values may be used in any data structure (e.g., as an element of a data frame or of a list)

Data Manipulation

• standard arithmetic and logical operations

• matrix operations including transpose, inner product, outer product, diagonal, trace, invert

• searching, sorting, merging

• data cleaning by logical commands

Why R?

Standard Statistical packages in R

• Descriptive and exploratory statistics

• The general linear model and its special cases

• t-test, ANOVA, MANOVA, regression, logistic regression, cox models, etc.

• multilevel models (mixed models, hierarchical models)

• time series, econometrics

• circular statistics, environmental-geographical statistics

Psychometric packages• Structural Equation Modeling

• Rasch Modeling of one parameter IRT

• Factor and Principal Component Analysis

• Rotations (Orthogonal: Varimax, Oblique: quartimax, quartimin, Promax, etc.)

• singular value decomposition

• eigen vector - eigen value decomposition

• Multidimensional scaling

R is extensible

• Functions can be defined easily and then stored for later use.

• Packages of functions are written by “all of us” to solve particular problems

• e.g. sem, rasch, rotation, lattice graphics

• Packages can be developed and tailored for a specific lab or problem area, or can be shared through the CRAN repository of (currently) > 400 packages

Programming in R-- personal examples

• Very Simple Structure

• 6 weeks in Fortran for a mainframe

• 3 weeks in Pascal for a Mac

• 2 days in R for Mac/Unix/Windows

• Schmid Leiman decomposition and estimation of coefficient omega

• 1 day

Popular misconceptions

• R is hard to learn

• R is not user friendly

• I can’t figure R out

• R is free -- it can’t be very good

Popular Misconceptions• R is hard to learn

• With a brief web based tutorial, 2nd and 3rd year undergraduates in psychological methods and personality research courses were using R for descriptive and inferential statistics and producing publication quality graphics

• R is easy to learn, hard to master

• R-help newsgroup is very supportive

• Multiple web based and pdf tutorials

Popular Misconceptions• R is not user friendly

• Does not have a GUI, but those are being developed

• Help and examples are embedded within program,

• ? function produces help with examples for any function

• R-help newsgroup is very supportive

Popular Misconceptions

• R is free, it can not be any good

• Developed by some of the best statisticians around

• vetted by all users, bugs when they exist, are rapidly fixed

• R community provides excellent and timely statistical and programming help (if somewhat acerbic to those who have not done their homework)

R: an international collaboratory

• Most appropriate for an International Society to be aware of the power of international collaboration to produce cutting edge software

R: statistics for all of usR: an international statistical collaboratory

Prepared for part of the symposium onMultivariate Statistical Methods in Individual Differences ResearchInternational Society for the Study of Individual Differences Biennial meeting, Adelaide, July , 2005

William Revelle, Northwestern Universitypersonality-project.org/r/personality-project.org/r.short.html

r: statistics for all of us - personality projectpersonality-project.org/r/why.r.pdf · r:...

Documents

efficient statistics in r

introduction to statistics with r

3 descriptive statistics with r

introduction to r and statistics

statistics environment r - lima-cityexample: hans w....

advanced statistics using r, asur - evolutionary biology ·...

un statistics division economic statistics branch national...

correlation and regression: example - the personality...

introduction to bio statistics 2nd edition r. sokal f. rohlf...

r statistics

probability and basic statistics with r

using r and the psych package to nd w - personality...

paswrprobability and statistics with r

statistics with r - Åbo akademi...

r freeware statistics package

discovering statistics using r - bibliothekdiscovering...

call center statistics with r

r statistics with mongo db

descriptive statistics and inferential statistics shibin liu...

exploratory statistics with r