r: statistics for all of us - personality projectpersonality-project.org/r/why.r.pdf · r:...

Post on 01-Apr-2020

13 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

R: statistics for all of usR: an international statistical collaboratory

Prepared for part of the symposium onMultivariate Statistical Methods in Individual Differences ResearchInternational Society for the Study of Individual Differences Biennial meeting, Adelaide, July , 2005

William Revelle, Northwestern Universitypersonality-project.org/r/

R: statistics for all of us

• What is it?

• Why use it?

• Common (mis)perceptions

• Examples for personality and individual differences research

R: What is it?

•R: An international collaboration

•R: the open source - public domain version of S+

•R: written by statisticians (and all of us) for statisticians (and the rest of us)

•R: an extensible language

Common statistical programs

General SpecializedR AMOSS+ EQSSAS LISRELSPSS Plus

STATA MxSYSTAT your favorite program.

Common statistical programsmost are costly

General SpecializedR AMO$$+ EQ$

$A$ LI$REL$P$$ Maple$

$TATA Mx$Y$TAT your favorite program.

R: a way of thinking(from the R point of view)

• “R is the lingua franca of statistical research. Work in all other languages should be discouraged.”

• “This is R. There is no if. Only how.”

• “Overall, SAS is about 11 years behind R and S-Plus in statistical capabilities (last year it was about 10 years behind) in my estimation.”

Taken from the R.-fortunes (selections from the R.-help list serve)

But it is open source - how can you trust it?

• Q: When you use it [R], since it is written by so many authors, how do you know that the results are trustable?

• A:The R engine [...] is pretty well uniformly excellent code but you have to take my word for that. Actually, you don't. The whole engine is open source so, if you wish, you can check every line of it. If people were out to push dodgy software, this is not the way they'd go about it.

Taken from the R.-fortunes (selections from the R.-help list serve)

What is R? : Technically

• R is an open source implementation of S (S-Plus is a commercial implementation)

• R is available under GNU Copy-left

• The current version of R is 2.1.1

• R is group project run by a core group of developers (with new releases semiannually)

• (Adapted from Robert Gentleman)

R: History• 1991-93: Ross Dhaka and Robert Gentleman begin

work on R project at U. Auckland

• 1995: R available by ftp under the GPL

• 96-97: mailing list and R core group is formed

• 2000: John Chambers, designer of S joins the R core (wins a prize for best software from ACM for S)

• 2001-2005: Core team continues to improve base package

• Many (>400) others contribute “packages”

Why R?

• Graphics for data exploration and interpretation

• Data manipulation including statistics as data

• Statistical analysis

• Standard univariate and multivariate generalizations of the linear model

• Multivariate-structural extensions

Why R? Graphics

• Sample graphics taken from

• http:personality-project.org/r/

• showing what can be done by an amateur

• http://addictedtor.free.fr/graphiques/

• showing some most impressive graphs

m-1.0-1.0M-1.0M

-1.0

-0.5-0.5M-0.5M

-0.5

0.00.0M0.0M

0.0

0.50.5M0.5M

0.5

1.01.0M1.0M

1.0

m-1.0-1.0M-1.0M-1

.0-0.5-0.5M-0.5M

-0.5

0.00.0M0.0M0.0

0.50.5M0.5M0.5

1.01.0M1.0M1.0

factor 1Mfactor 1M

factor 1

factor 2Mfactor 2M

facto

r 2

Two dimensions of affectMTwo dimensions of affectM

Two dimensions of affect

sluggishMsluggishM

sluggish

depressedMdepressedM

depressed

satisfiedMsatisfiedM

satisfied

relaxedMrelaxedM

relaxed

blueMblueM

blue

intenseMintenseM

intense

scaredMscaredM

scared

enthusiasticMenthusiasticM

enthusiastic

sadMsadM

sad

full_of_pepMfull_of_pepM

full_of_pep

unhappyMunhappyM

unhappy

livelyMlivelyM

lively

tiredMtiredM

tired

dullMdullM

dull

at_easeMat_easeM

at_ease

M M

guiltyMguiltyM

guilty

excitedMexcitedM

excited

gloomyMgloomyM

gloomy

nervousMnervousM

nervous

grouchyMgrouchyM

grouchy

calmMcalmM

calm

drowsyMdrowsyM

drowsy

alertMalertM

alert

surprisedMsurprisedM

surprised

energeticMenergeticM

energetic

sorryMsorryM

sorry

sleepyMsleepyM

sleepy

pleasedMpleasedM

pleased

happyMhappyM

happy

distressedMdistressedM

distressed

contentMcontentM

content

tenseMtenseM

tense

ashamedMashamedM

ashamed

anxiousManxiousM

anxious

cheerfulMcheerfulM

cheerful

Standard Plots of factor loadings

m-0.4-0.4M-0.4M

-0.4

-0.2-0.2M-0.2M

-0.2

0.00.0M0.0M

0.0

0.20.2M0.2M

0.2

0.40.4M0.4M

0.4

0.60.6M0.6M

0.6

0.80.8M0.8M

0.8

1.01.0M1.0M

1.0

m-0.4-0.4M-0.4M-0

.4-0.2-0.2M-0.2M

-0.2

0.00.0M0.0M0

.00.20.2M0.2M

0.2

0.40.4M0.4M0

.40.60.6M0.6M

0.6

0.80.8M0.8M0

.81.01.0M1.0M

1.0

correlations with happyMcorrelations with happyM

correlations with happy

correlations with sadMcorrelations with sadM

co

rre

latio

ns w

ith

sa

d

correlations with happy and sadMcorrelations with happy and sadM

correlations with happy and sad

sluggishMsluggishM

sluggish

depressedMdepressedM

depressed

satisfiedMsatisfiedM

satisfied

relaxedMrelaxedM

relaxed

blueMblueM

blue

intenseMintenseM

intense

scaredMscaredM

scared

sadMsadM

sad

full_of_pepMfull_of_pepM

full_of_pep

unhappyMunhappyM

unhappy

livelyMlivelyM

lively

tiredMtiredM

tired

dullMdullM

dull

at_easeMat_easeM

at_ease

M M

guiltyMguiltyM

guilty

excitedMexcitedM

excited

gloomyMgloomyM

gloomy

nervousMnervousM

nervous

grouchyMgrouchyM

grouchy

calmMcalmM

calm

drowsyMdrowsyM

drowsy

alertMalertM

alert

surprisedMsurprisedM

surprised

energeticMenergeticM

energetic

sorryMsorryM

sorry

sleepyMsleepyM

sleepy

pleasedMpleasedM

pleased

happyMhappyM

happy

distressedMdistressedM

distressed

contentMcontentM

content

tenseMtenseM

tense

ashamedMashamedM

ashamed

anxiousManxiousM

anxious

cheerfulMcheerfulM

cheerful

Data points can be dynamically identified

-1.0

/Users/Bill/Users/Bill

-0.5

/Users/Bill

0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

r with 0 degrees

r w

ith 9

0 d

egre

es

Simulated correlations 90 degrees

/Users/Bill

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

r with happy

r w

ith n

erv

ous

Observed correlations with happy and nervous

/Users/Bill

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

r with 0 degrees

r w

ith 1

20 d

egre

es

Simulated correlations 120 degrees

/Users/Bill

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

r with happy

r w

ith s

ad

Observed correlations with happy and sad

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

r with 0 degrees

r w

ith 1

80 d

egre

es

/Users/Bill

Simulated correlations 180 degrees

/Users/Bill

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

r with active

r w

ith inactive

Observed correlations with active and inactive

Multi-panel graphs can be labeled separately and organized vertically or

horizontally

Simulated data can be generated to fit normal,

rectangular, binomial, poissen, exponential, etc.

distributions

epiE

0 2 4 6 8 10 12

0.85 0.80

0 1 2 3 4 5 6 7

-0.22

510

15

20

-0.180

24

68

10

12

~

epiS

0.43 -0.05 -0.22

~

epiImp

-0.24

02

46

8

-0.07

01

23

45

67

~

epilie

-0.25

5 10 15 20 0

~

2 4 6 8

~

0 5 10 15 20

05

10

15

20epiNeur

Scatter Plot Matrices can show smoothed fits

epiE

0 2 4 6 8 10 12

0.85 0.80

~

0 1 2 3 4 5 6 7

-0.22

510

15

20

-0.18

02

46

810

12

epiS

0.43 -0.052 -0.22

~

epiImp

-0.24

02

46

8

-0.074

01

23

45

67

~

epilie

-0.25

5 10 15 20 0 2 4 6 8

~

0 5 10 15 20

05

10

15

20epiNeur

Can scale font size of correlations by absolute size of r

−1.0 −0.5 0.0 0.5 1.0 1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

Movie Study 1

Positive Affect

Nega

tive

Affe

ct

Sad

Threat Neutral

Happy

−1.0 −0.5 0.0 0.5 1.0 1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

Movie Study 2

Positive Affect

Nega

tive

Affe

ct

Sad

Threat

Neutral

Happy

Error bars on two dimensions

−1.0 −0.5 0.0 0.5 1.0 1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

PA x NA

Positive Affect

Nega

tive

Affe

ct Sad

ThreatNeutral

Happy

−1.0 −0.5 0.0 0.5 1.0 1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

PA x NA

Positive Affect

Nega

tive

Affe

ct

Sad

Threat

Neutral

Happy

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

0.5

1.0

1.5

PA x EA

Postive Affect

Ener

getic

Aro

usal

Sad

ThreatNeutral

Happy

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

0.5

1.0

1.5

PA x EA

Postive Affect

Ener

getic

Aro

usal

Sad

Threat

Neutral

Happy

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

0.5

1.0

1.5

NA x TA

Negative Affect

Tens

e Ar

ousa

l SadThreat

NeutralHappy

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−1.5

−0.5

0.5

1.0

1.5

NA x TA

Negative Affect

Tens

e Ar

ousa

l SadThreat

Neutral

Happy

Study 2 Study 3

Presentation graphicsscale to fit page

and produce output for pdf presentations

Window or page size is controllable

Graphic output can be taken from multiple programs (e.g. Very Simple Structure, factor analysis

5 10 15

46

81

01

2

x1

y1

5 10 15

46

81

01

2

x2

y2

5 10 15

46

81

01

2

x3

y3

5 10 15

46

81

01

2

~

x4

y4

Anscombe's 4 Regression data sets

Built-in data sets provide useful demonstrations of stats

Mapping data (GIS) may be combined with descriptive data

Many GIS map files are available for download from ESRI

GIS maps detail regional boundaries

US counties

Counties of California

~

Median Household Income for 2000

2000 median household income for zip codes

~

Median Household Income for 2000

Combine income data from 2000 census with zipcode map of San Diego County

100 105 110 115 120 125 130

-50

510

Borneo and its surrounds

longitude

latitude

GIS files of Borneo can show country boundaries (e.g., Malaysia, Brunei, Indonesia)

100 105 110 115 120 125 130

-50

510

Borneo and its surrounds

longitude

latitude

Public access GIS files include rivers and roads

Even more graphics

• Taken from a collection of R demonstrations and graphics

• http://addictedtor.free.fr/graphiques/

c(-1, 1)

~

PCA 5 varsprincomp(x = data, cor = cor)

x1

Fertility

Agriculture

Examination

Education Catholic

(1-3) 60%

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

0.0

0.5

1.0

1.5

c(-1, 1)

Clustering 4 groups

80 60 40 20 0

48Groups

2

1

16

28

~

V. De Geneve

c(-1, 1)

c(0

, 1)

Factor 1 [41%]

c(-1, 1)

c(0

, 1)

Factor 3 [19%]

Principal components and clustering of sources of variance in USA arrest data

~

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

Mixture models

1

~

2 3 4 5 6 7 8 9 10

0

2

4

6

Notched Boxplots

Notched Boxplots show confidence regions

~

Scatter Plot Matrix

Sepal

Length

Sepal

Width

~

Petal

Length

setosa

Sepal

Length

Sepal

Width

~

Petal

Length

versicolor

Sepal

Length

~

Sepal

Width

Petal

Length

virginica

Three

Varieties

of

Iris

Multipanel graphs

~

Height (inches)

De

nsity

60 65 70 75

0.00

0.05

0.10

0.15

0.20

~

Bass 2 Bass 1

60 65 70 75

Tenor 2

~

Tenor 1

Alto 2

60 65 70 75

Alto 1

~

Soprano 2

60 65 70 75

0.00

0.05

0.10

0.15

0.20

Soprano 1

Histograms and fitted distributions

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

Combine scatter plot with histograms

Why R?

• Graphics for data exploration and interpretation

• Data manipulation including statistics as data

• Statistical analysis

• Standard univariate and multivariate generalizations of the linear model

• Multivariate-structural extensions

Data ManipulationData Entry

• from console

• from clipboard (copied from other programs)

• from file (text files, csv, SPSS, Excel, MySQL)

• from the web

Data Manipulation: Data Structures

• Data types: integer, real, logical, character, string

• Vectors of any data type

• Matrices of any data type

• Data Frames (similar to matrix of mixed type)

• Lists of any mixture of types

• All operations are functions and the returned values may be used in any data structure (e.g., as an element of a data frame or of a list)

Data Manipulation

• standard arithmetic and logical operations

• matrix operations including transpose, inner product, outer product, diagonal, trace, invert

• searching, sorting, merging

• data cleaning by logical commands

Why R?

• Graphics for data exploration and interpretation

• Data manipulation including statistics as data

• Statistical analysis

• Standard univariate and multivariate generalizations of the linear model

• Multivariate-structural extensions

Standard Statistical packages in R

• Descriptive and exploratory statistics

• The general linear model and its special cases

• t-test, ANOVA, MANOVA, regression, logistic regression, cox models, etc.

• multilevel models (mixed models, hierarchical models)

• time series, econometrics

• circular statistics, environmental-geographical statistics

Psychometric packages• Structural Equation Modeling

• Rasch Modeling of one parameter IRT

• Factor and Principal Component Analysis

• Rotations (Orthogonal: Varimax, Oblique: quartimax, quartimin, Promax, etc.)

• singular value decomposition

• eigen vector - eigen value decomposition

• Multidimensional scaling

R is extensible

• Functions can be defined easily and then stored for later use.

• Packages of functions are written by “all of us” to solve particular problems

• e.g. sem, rasch, rotation, lattice graphics

• Packages can be developed and tailored for a specific lab or problem area, or can be shared through the CRAN repository of (currently) > 400 packages

Programming in R-- personal examples

• Very Simple Structure

• 6 weeks in Fortran for a mainframe

• 3 weeks in Pascal for a Mac

• 2 days in R for Mac/Unix/Windows

• Schmid Leiman decomposition and estimation of coefficient omega

• 1 day

Popular misconceptions

• R is hard to learn

• R is not user friendly

• I can’t figure R out

• R is free -- it can’t be very good

Popular Misconceptions• R is hard to learn

• With a brief web based tutorial, 2nd and 3rd year undergraduates in psychological methods and personality research courses were using R for descriptive and inferential statistics and producing publication quality graphics

• R is easy to learn, hard to master

• R-help newsgroup is very supportive

• Multiple web based and pdf tutorials

Popular Misconceptions• R is not user friendly

• Does not have a GUI, but those are being developed

• Help and examples are embedded within program,

• ? function produces help with examples for any function

• R-help newsgroup is very supportive

Popular Misconceptions

• R is free, it can not be any good

• Developed by some of the best statisticians around

• vetted by all users, bugs when they exist, are rapidly fixed

• R community provides excellent and timely statistical and programming help (if somewhat acerbic to those who have not done their homework)

R: an international collaboratory

• Most appropriate for an International Society to be aware of the power of international collaboration to produce cutting edge software

R: statistics for all of usR: an international statistical collaboratory

Prepared for part of the symposium onMultivariate Statistical Methods in Individual Differences ResearchInternational Society for the Study of Individual Differences Biennial meeting, Adelaide, July , 2005

William Revelle, Northwestern Universitypersonality-project.org/r/personality-project.org/r.short.html

top related