introductory guide to s-plussun.cwru.edu/~jiayang/smanual35-51.pdf · introductory guide to s-plus...

51
Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, University of Oxford e-mail: 24 August 1994

Upload: others

Post on 14-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Introductory Guideto

S-Plus

Final Version

B.D. RipleyProfessor of Applied Statistics,

University of Oxford

e-mail: ����������� ��������������������������

24 August 1994

Page 2: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Preface

This guide was originally written for graduate students in Statistics at the University of Ox-ford. The first versions were based closely on notes by Dr. Bill Venables of the Departmentof Statistics at the University of Adelaide, but have been updated to reflect later versions of S,the extensions of S-Plus and local facilities. Several sections, in particular 4, 6 and 11, remainclose to Dr. Venables’ original material. This guide will no longer be updated, following thepublication of Venables & Ripley (1994). [See p. 1. Where that takes a significantly betterapproach than earlier editions of these notes, the material formerly here has been dropped.]

The guide is to S-Plus, but much of it will be relevant to users of the underlying S. Extensionswhich are only in S-Plus include dynamic graphics ( � 6.3, �������� and ���� ) and the classicalstatistics functions ( � 9). The terminology of this guide is intended to be precise, only referringto S-Plus rather than S for features unique to S-Plus.

These notes were written for a particular environment, S-Plus 3.2 on Sun SparcStations runningthe Open Windows windowing system. You will find a number of differences depending onyour local environment. It will help to have the library � �� ����� available — it should be in thesame source as these notes. It can be also be obtained by anonymous ftp from

��� ��� ��������� � � ������ �� �"! �#��� (163.1.20.1)

in file "��� $"%�$&�'� ����(�)�*�(�,+ . It is available from ��� � � ����� (see Section A.2) as

�����.-/� �'�"�0�/1��.� � %Alternatively, �.���� � ���3254"6'%"%"7 from Venables & Ripley (1994) can be used.

This guide may be freely copied and redistributed for any educational purpose (including com-mercial courses) provided its authorship (B.D. Ripley and W.N. Venables) is clearly stated.Where appropriate, a small charge to cover the costs of production and distribution, only, maybe made.

B.D. Ripley,University of Oxford,24th August, 1994.

i

Page 3: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Contents ii

Contents

1 Introduction 1

1.1 Starting and Finishing ��������������������������������������������������������� 1

1.2 Getting Help ������������������������������������������������������������������� 2

1.3 Hardcopy Output ��������������������������������������������������������������� 3

2 Datasets 3

3 A First Session 5

4 Simple Data Manipulation 6

4.1 Vectors ������������������������������������������������������������������������� 6

4.2 Vector Arithmetic ������������������������������������������������������������� 6

4.3 Generating Regular Sequences of Numbers. ����������������������������������� 7

4.4 Logical Vectors. Missing Values ����������������������������������������������� 8

4.5 Character Vectors ������������������������������������������������������������� 8

4.6 Index Vectors. Selecting and Modifying Subsets of a Data Set ��������������� 9

4.7 Arrays ������������������������������������������������������������������������� 10

4.8 Lists ��������������������������������������������������������������������������� 11

4.9 Data Frames ������������������������������������������������������������������� 12

5 Reading data into S 14

5.1 Writing out data ��������������������������������������������������������������� 15

6 Graphics 16

6.1 Graphical Parameters ��������������������������������������������������������� 16

6.2 Some Basic Plotting Functions ������������������������������������������������� 17

6.3 Interaction with Plots ��������������������������������������������������������� 17

6.4 Brush and Spin ����������������������������������������������������������������� 18

6.5 Equally-scaled plots ����������������������������������������������������������� 18

Page 4: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Contents iii

7 Statistical Summaries 20

7.1 Arithmetical Summaries ������������������������������������������������������� 20

7.2 Histograms and Stem-and-Leaf Plots ����������������������������������������� 20

7.3 Boxplots ����������������������������������������������������������������������� 21

8 Distributions 22

8.1 Q-Q Plots ��������������������������������������������������������������������� 23

9 Classical Statistics 24

10 Handling Categorical Data 27

10.1 The Function ������� ��� ������� and Ragged Arrays ������������������������������� 28

11 Loops and Conditional Execution 29

12 Writing Your Own Functions 30

13 Statistical Models 32

13.1 Model Formulas ��������������������������������������������������������������� 32

13.2 One-way Layouts ������������������������������������������������������������� 33

13.3 Designed Experiments ��������������������������������������������������������� 35

13.4 Generalized Linear Models ��������������������������������������������������� 39

13.5 Updating and Selecting Models ����������������������������������������������� 42

14 Multivariate Analysis 43

Appendix

A Libraries 45

A.1 Library ��� ������ ��������������������������������������������������������������� 46

A.2 Sources of Libraries ����������������������������������������������������������� 46

Page 5: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Intr�

oduction 1

1 Introduction

S is a statistical language developed at AT&T’s Bell Laboratories. S-Plus is a binary distri-bution of S, with added functions, produced by the StatSci Division of MathSoft in Seattle.The S system was radically re-designed in the 1988 release and known as ‘New S’. In August1991 a new release of what is once again called S consisted of a moderate revision of ‘NewS’ together with far-ranging extensions. S-Plus 3.0 was introduced in late 1991, based on thatrelease of S, with numerous additional features. S-Plus 3.1 was released at the very end of1992, and S-Plus 3.2 in very early 1994.

The main references are:

R.A. Becker, J.M. Chambers and A.R. Wilks (1988) The NEW S language. Wadsworth &Brooks/Cole.

J.M. Chambers and T.J. Hastie (1992) Statistical Models in S. Wadsworth & Brooks/Cole.

It is not the intention of this guide to replace the books. Rather these notes are intended asa brief introduction to the capabilities of the S programming language and to how to performsome common statistical procedures within S. Users of S-Plus will need to consult both books,probably frequently. Both books contain some reference documentation, but the on-line ver-sions (see � 1.2) are later and definitive.

There also manuals for S-Plus itself, whose organization differs from release to release.

Other books include

W.N. Venables and B.D. Ripley (1994) Modern Applied Statistics with S-Plus. New York:Springer ISBN 0-387-94350-1

which goes far beyond the coverage of this guide, including many topics (such as robust statis-tics, non-linear regressions, modern regression, survival analysis, tree-based models, time se-ries and spatial statistics) not covered here, as well as in greater depth on what is covered.

1.1 Starting and Finishing

To start S-Plus, type the command���������� ������������

After a short while (and, the first time, an initialization message) you get the S-Plus prompt � :�

This is waiting for input from you.

Technically S is a function language with a very simple syntax. Like most Unix based packagesit is case sensitive, so � and � are different variables. Elementary commands consist of eitherexpressions or assignments. If an expression is given as a command, it is evaluated, printed,and the value is discarded � . An assignment also evaluates an expression and passes the value

�which can be changed, but the default is assumed here�In fact it is kept in the (hidden) variable � ��!#"%$� &�!('*)�+ and so can be retrieved from the ‘bin’.

Page 6: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

1.2 Getting Help 2

to a variable but the result is not printed automatically. An expression can be as simple as ����or a complex function call. Assignments are indicated by the assignment operator ��� or .

(As the first needs two keystrokes, lazy typists use the second. However, the first is easier toread.) For example,

� ��� �� ���������������������� ���"!� �# ��$&%('�'�)�)��� �*� �+�������,�+����� �-�"!/.10 ��� 0�� �2���+� �"�����+!����34��5����6��0�!� � �7% $-)"8 �

The� �

states that the answer is starting at the first element of a vector.

Commands are separated either by a semi-colon, . , or by a newline. If a command is not com-plete at the end of a line, S will give a different prompt, namely

�on second and subsequent lines and continue to read input until the command is syntacticallycomplete.

S can be extended by writing new functions, which then can be used exactly as built-in func-tions (and can even replace them). How to write your own functions is covered in section 12.

1.2 Getting Help

S has an inbuilt help facility similar to the man facility of Unix. To get more information onany specific named function or dataset, for example

�+�*�9�, the command is

�:����;�<��=�������+!

For a feature specified by special characters, and in a few other cases (one is > � ?+@*��� > ), theargument must be enclosed in double quotes, making it a ‘character string’:�:����;�<�� > � > !

Help uses a window which overlays your main window. The pager accepts a number of options,including � <"�*A�� for the next page and 5 to quit. (Other useful options are

Bto go to the top

andA*C9�*�4�DC*; ��E to go back a page.) If you prefer, a separate help window (which can be left

up) can be obtained by the argument?+@��DFDC�?4G�H

. Another way to get help is by�JI �+�*�9�

Short help is given by the function� �-K+�

.

S-Plus also has a window-based help facility, started by�:����;�< % � �"�������=K*L"@9G > C9<"����;�C�C�M > !

Click with the left mouse button on items to select categories and items. The help window canbe left up, or removed by

Page 7: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

1.3 Hardcopy Output 3

������������ � ����It is not advisable to quit S-Plus windows from the frame menu.

1.3 Hardcopy Output

Graphics are printed by holding down the right button on the ����� ��� menu in a �������������������window (see � 6) and releasing over the print item. This will print on the nearest laser printer(or that selected by your ���! #"�$�%�� environment variable).

To record a session cut-and-paste to a & ��' & �)(�* & window, then remove your mistakes (if any)and save as a Unix file.

2 Datasets

Datasets are stored in a directory +-, /. ��&�� . They are permanent, so all the objects you createare retained until explicitly deleted. (As the directory name

/. ��&�� begins with

it will nor-mally be hidden in file listings from Unix by

�10.) If there is a

2. ��&�� directory in the currentdirectory when S is invoked, that directory is used rather than +-, 2. ��&�� . This provides oneway to organize your S, using separate directories for each project.

In S, to get a list of names of the objects currently defined use the command�3��4156��7 & 08�9�Your own functions are also stored in /. ��&�� . To find out whether an object is a function ordataset, and what is in it, just type its name at the prompt, e.g.�:0 &�� 7��2'������� &This prints out the function, dataset, ;9;9; . In the later versions of S it may print a short summaryof the object. To get the full details, use��� � *<� & =(��� ��> � & � object

�When S looks for an object, it searches in turn through a sequence of directories known as thesearch list. Usually the first entry in the search list is the

/. ��&�� sub-directory of the currentworking directory. The names of the directories currently on the search list can be found bythe function�:0?� ��� 7������The names of the objects held in any directory on the search list can be displayed by giving the�10 function an argument. For example ��4156��7 & 08�A@1� lists the contents of the second directoryin the search list. Normally the second, third and fourth directories are built-in functions, andthe fifth, sixth and seventh contain standard datasets

Extra search directories can be added to this list with the ��&�&�� 7���� ;9;9; � function and removedwith the

(6� &�� 7���� ;9;9; � function, details of which can be found in the manuals or the�������

fa-

Page 8: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Datasets�

4

cility. Note that attached directories are searched after the ��������� directory in the order lastattached to first attached.

To remove objects permanently the function � is available:� � ��� ��������������������� ������!"$#&%The function �'!�&(�)�!*�,+"+"+�% can be used to remove objects with non-standard names.

Warning

Objects in your ��������� directory will take precedence over system objects of the same name.This is a frequent cause of rather obscure errors, and can cause apparently correct behaviour buterroneous results. Avoid using names such as -.�0/1�2� �4365�7�4�$�8� 3�!9�:�;�$! ! for your ownobjects. If you get peculiar errors, clean up your �<������� directory and try again!

S keeps a record of commands in the ��=��;>6��� file in the �<�6�8�6� directory. This is a hidden fileand can grow rather large. Use (from the Unix command line)? #658�@/BA�C�D E;F'G8=�D ��H�AJIoccasionally to clean out the audit file entirely (or omit the I to keep the last 0.5Mb).

Page 9: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

A�

First Session 5

3 A First Session

The sample session given below is intended to show by example some of the capabilities of thesystem. Work through the session given by the commands on the left of the page. Some cluesas to what is going on are given at the right hand side of the page.��������� ������������� Start the session.��� ������ ����� ��! Open the graphics window.� �"�$#&%"�'%�( � % �$���&�)( ! Add a library of functions and datasets.� � ����� �+* %������ ! use q to quit�,* %������ Print out a data frame of the trees data� � *�* ���-� �+* %"����� ! so that we can use names diam etc� �.��� * �0/ ���$� ! Histogram as counts.� �.��� * �0/ ���$�213 ����������4�5�6 17�&% � # �-#.�'�"� * (&4�8 ! as probability density� � ����� � �.��� * !� � * �$� �0/ ���$� ! Stem-and-leaf plot.� ��� �9* �0/ ���$�21;: � ������� ! Scatter plot.�,* %������=<>�-�@?�A��-� � : � �)���.�CB / ���$� ! linear regression� ���������'%�( �D* %������ <>�-� ! summary of fit� �� � :"� �D* %"����� <E��� ! analysis of variance table� ��#��"���� �+* %"�����=<>��� ! plot line on scatter plot� � / �� * ��F�( �G/ �9�H�21;: � �)���.�I17���&��J�� *�! Move mouse to plot and click with left button

to see what height is. Click middle button toquit.� � �'% � ��F�% �9K 4&� � 51+L !�! set up 1 row, 2 cols for plots� ��� �9* �+* %������=<>�-� ! plots of fitted values and M residuals M vs fitted value.� � �'% � ��F�% �9K 4&� � 51-5 !�! one plot again.��N�N � %'� � %"���O� / � ���&� �+* %"�����=<>�$� !�! normal probability plot of residuals��N�N � %'� � � * � / %���� �+* %������=<>�-� !�! and of Studentized residuals��N�N �"���� � � * � / %���� �+* %������=<>�-� !�! line through quartiles� � �O��%"� �D* %"����� ! all pair-wise scatter plots� #&%�� ��� � �-#.�� /I�0/ �9�H�21P���&��J�� * 1Q: � �����.� !�! rotate points in 3D, select and de-select points.Click on

N �.� * to end�,* %������=<>�-� L�?�A���� � : � �������CB / �9�H�SRT� �&�-J9� * ! multiple regression. Try functions as before�,* %������=<>�-� U�?�A���� � � � J � : � �)���.� ! BV� � J �G/ ���$� ! RC� � J � � �&�-J9� * !�!��/ � * ���-� �'W�* %������ W9! to avoid any confusion� � ����� � % � � /"!� � *�* ���-� � % � � /"!� ��� �9* �0/ %���:"�'%"�X1 / ��� * � � !� ��� �9* �0/ %���:"�'%"�X1 / ��� * � �X1Y� � J&4 W$Z ( W9!� � * � * ��?�AT% �9K <[ �H����� � % � � /�!� � / �� * ��F�( �G/ % �-:"�)%��I1 / ��� * � �X1\� * � * � ! Find the ‘odd’ states.� ��� �9* � F9�����]1 / ��� * � �I1^� � JO4 W$Z ( W�!� � / �� * ��F�( � F�� ���]1 / ��� * ���I1\� * � * � !� % � � / <_��� * ?�A`�-#.�$ /I�0/ %���:"�'%"�X17F�� ��� 1 / ��� * � � ! Set up a matrix� � �O��%"� � % � � / <a�.� * ! Look at pattern of all three� #&%�� ��� � % � � / <a�.� * 1;% ��K �&��#�4&� * � * �X1b�-�.�$�4�c ! Use mouse to highlight points and check their

identity. Then click onN �.� *

��NI�-!Finish session

Page 10: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Simple�

Data Manipulation 6

4 Simple Data Manipulation

The basic data objects in S are vectors, arrays, lists and data frames.

4.1 Vectors

S operates on named data structures. The simplest such structure is the vector, which is a sin-gle entity consisting of an ordered collection of numbers. To set up a vector named � , say,consisting of five numbers, namely 10.4, 5.6, 3.1, 6.4 and 21.7, use the S command� ���������� ������������������������ �!����"#��%$'&This is an assignment statement using the function �)(*()(+& taking an arbitrary number of vectorarguments and whose value is the vector of its arguments.

A number occurring by itself in an expression is taken as a vector of length one.

Assignments can also be made in the other direction, using the obvious change in the assign-ment operator. So the same assignment could be made using� �,��� ��!���-� �����-�����.�/� �!����"'��%$'&0� � �If an expression is used as a complete command, the value is printed and lost. So now if wewere to use the command� ��12�the reciprocals of the five values would be printed (and, of course, the value of � would beunchanged).

4.2 Vector Arithmetic

Vectors can be used in arithmetic expressions, in which case the operations are performedelement-by-element. Vectors occurring in the same expression need not all be of the samelength. If they are not, the value of the expression is a vector with the same length as the longestvector which occurs in the expression. Shorter vectors in the expression are recycled as oftenas need be (perhaps fractionally) until they match the length of the longest vector. In particulara constant is simply repeated. So with the above assignments the command�43 ����"652�87�987:�generates a new vector 3 of length 11 constructed by adding together, element-by-element, "�52�repeated 2.2 times, 9 repeated just once, and � repeated 11 times.

The elementary arithmetic operators are the usual 7 , � , 5 , 1 and ; for raising to a power. Inaddition all of the common arithmetic functions are available. <�=2> , <�=2>?�� , @A�6B , C6DFE , ��=GC ,H'I E , C2J�K H , and so on, all have their usual meaning. L I � and L�DFE select the largest and small-est elements of an vector respectively. K I E6>'@ is a function whose value is a vector of lengthtwo, namely �,ML?DNE ��O&��L I �PM�O&6& . The element-by-element maximum and minimum of two ormore vectors are given by BQL I � and BQL�DFE . <�@2E�> H6R S�T& is the number of elements in � , CNUQL�S�O&gives the total of the elements in � and BGK#=�V�S�O& their product.

Page 11: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

4.3�

Generating Regular Sequences of Numbers. 7

Two statistical functions are ����������� , which evaluates to ��������������������������� and ���������� ,which gives the value ���� ���!�"������#�$������&%#'����(�)����������������"&*�� , the sample variance. If theargument to �+�����-,.,.,/� is an 02143 matrix the value is a 35163 sample covariance matrix obtainedfrom regarding the rows as independent 3 -variate sample vectors.

�7�������� returns a vector of the same size as with the elements arranged in increasing order.Other, more flexible, sorting facilities are available (see 7���89�����-,.,.,�� which produces a permu-tation to do the sorting, and �7����6:!�&;� �� ).

4.3 Generating Regular Sequences of Numbers.

S has a number of facilities for generating commonly used sequences of numbers. For ex-ample *=<?>A@ is the vector B=�C*ED/'$D),.,.,�DF'�GHD!>A@A� . The colon operator has highest priority withinan expression, so, for example '�I�*=<)*�J is the vector BK�)'HD?L DFMHD),.,.,#D/'�ND�>9@&� . Put �PO�"Q*#@ andcompare the sequences *K<R�+"9* and *K<#�!��"9*�� .The construction >A@�<S* may be used to generate a backwards sequence.

The function ���T��-,.,.,/� is a more general facility for generating sequences. It has five argu-ments, only some of which may be specified in any one call. The first two arguments, if given,specify the beginning and end of the sequence, and if these are the only two arguments giventhe result is the same as the colon operator. That is, ���T$�S'HD-*�@&� is the same vector as '�<S*#@ .Parameters to ���T��-,.,.,/� , and to many other S functions, can also be given in named form, inwhich case the order in which they appear is irrelevant. The first two parameters may be namedU �97���V value and ��7�V value; thus ���T$��*(D!>A@&� , ���T$� U �97��9V�*EDW�A7�V�>�@&� and ���T����+7�V�>9@HD U �A7.�9V�*#�are all the same as *=<?>A@ . The next two parameters to ���T$�-,.,.,/� may be named X�YAV value and���#�������&V value, which specify a step size and a length for the sequence respectively. If neitherof these is given, the default X�YAV�* is assumed.

For exampleZ ���T��)"�J$D[J$D[X�Y&V :\'&�]" Z �>generates in �> the vector B=�S"�J^:!@HDF"�L6:RN^D/"�L6:?MHDS,),.,�D?L4:\MHDWL6:WND/J�:?@A� . SimilarlyZ �LQO�"_ ���T$�S������������VAJ9*�D U �97���VA"�J$D`X�YAV :\'&�generates the same vector in �L .

The fifth parameter may be named ����7����&V vector, which if used must be the only parameter,and creates a sequence *EDa'HDb,.,-,�Db�����������^� vector � , or the empty sequence if the vector isempty (as it can be).

A related function is �A�#c^�-,.,.,/� which can be used for replicating a structure in various com-plicated ways. The simplest form isZ �JdO�"e�A�#c^�! Df��;.���& #VAJ&�which will put five copies of end-to-end in �J .

Page 12: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

4.4�

Logical Vectors. Missing Values 8

4.4 Logical Vectors. Missing Values

As well as numerical vectors, S allows manipulation of logical quantities. The elements of alogical vector have just two possible values, represented formally as � (for ‘false’) and � (for‘true’). ( ������� and ����� � are also valid representations.)

Logical vectors are generated by conditions. For example���������������������sets

�������as a vector of the same length as

�with values � corresponding to elements of

�where

the condition is not met and � where it is.

The logical operators are � , � � , � , � � , ��� for exact equality and ! � for inequality. In addition if" � and "�# are logical expressions, then " �%$ " � is their intersection (and), " �'& "�# is their union(or) and ! " � is the negation of " � .Logical vectors may be used in ordinary arithmetic, in which case they are coerced into numericvectors, � becoming ( and � becoming

�. However there are situations where logical vectors

and their coerced numeric counterparts are not equivalent.

In some cases the components of a vector may not be completely known. When an elementor value is “not available” or a “missing value” in the statistical sense, a place within a vectormay be reserved for it by assigning it the special value )� . In general any operation on an )�becomes an )� . The motivation for this rule is simply that if the specification of an operationis incomplete, the result cannot be known and hence is not available.

The function *�+-,/.�021 �43 gives a logical vector of the same size as � with value � if and only ifthe corresponding element in � is )� .� *�.�5 ��� *�+-,6.�02187 3

4.5 Character Vectors

Character quantities and character strings are used frequently in S, for example as plot labels.They are denoted by a sequence of characters delimited by the double quote character. E.g.9 ���;: 0�<;= � + 9 , 9 ) �?> * ��� @ 0 � *BA;. @�� +�=�< � + 9 . Single quotes can also be used, in matching pairs.

Character strings may be collected into a vector by the " 1DCECDC 3 function; examples of their usewill emerge frequently.

The� 0�+ ��� 1ECDCDC 3 function takes an arbitrary number of character string arguments and concate-

nates them into a single character string. Any numbers given among the arguments are coercedinto character strings in the same way they would be if they were printed. The arguments areby default separated in the result by a single blank character, but this can be changed by thenamed parameter, + �?��� string, which changes it to string, possibly empty.

For example� <�0;FG+ ����� 0�+ ��� 1 " 1 9DHI9GJD9DK49 3 J �MLN� ( J + �?��� 9�9 3makes <�0;FG+ the character vector 1 9DH � 9%JO9DK # 9PJQ9DH � 9PJ CDCDC JR9SH�T�9%JO9DK � ( 9 3 . Note in par-ticular that recycling of short vectors takes place here too; thus " 1 9DHP9PJQ9DKP9 3 is repeated 5

Page 13: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

4.6�

Index Vectors. Selecting and Modifying Subsets of a Data Set 9

times to match the sequence.

The elements of a vector can be named (as well as numbered) by assigning a character vectorto its �������� attribute, e.g.

��� ��������� ��������������� �"!#�%$�$#� �'&"( �������� ���� ����� ( ��� �)�+*+, �-���-��� *.� * �-/�/'0�� *.�1*2 "3 �2��4'� *.�%*+5�6 4 *.�1*87�6�9�6*"(��� �����, �-���-�'���-/�/'0�� :3 �2��4�� 5�6 4 7�6�9�6��� �'� �"! $�$ �;&

4.6 Index Vectors. Selecting and Modifying Subsets of a Data Set

Elements of a vector may be extracted by specifying the element in square brackets, e.g. <>= �:? .More generally, subsets of a vector (or any expression that evaluates to a vector) may be se-lected by appending to the name of the vector an index vector in square brackets. Such indexvectors can be any of four distinct types:

1. A logical vector. In this case the index vector must be of the same length as the vector fromwhich elements are to be selected. Values corresponding to @ in the index vector areselected and those corresponding to A omitted. For exampleCB ���C<D=FE 6 ��GH�'� � < (�?creates (or re-creates) an object

Bwhich will contain the non-missing values of < , in the

same order. Note that if < has missing values,B

will be shorter than < . AlsoI� <J &K( = � E 6 ��GL��� � < ("(NM < �O"? � QPcreates an object

Pand places in it the values of the vector <J & for which the correspond-

ing value in < was both non-missing and positive.

2. A vector of positive integral quantities. In this case the values in the index vector mustlie in the the set R &S�1���NT8TUTF� 0��-��4���V � < (�W . The corresponding elements of the vectorare selected and concatenated, in that order, in the result. The index vector can be of anylength and the result is of the same length as the index vector. For example <>= �:? is thesixth component of < and <D= &�X�&2O"?selects the first 10 elements of < (assuming 0��2��4���V � < (DYZ&:O ). Also��)�+* < *.�8*8B[*K( = 3 ��/ �\�.��&.�]�^�]���U&2()� � 6 �����-_ ��(-?(an admittedly unlikely thing to do) produces a character vector of length 16 consistingof* < *S�`*8B[*S�`*8B.*.�`* < * repeated four times.

3. A vector of negative integral quantities. In this case the index vector specifies the valuesto be excluded rather than included. ThusCB ���C<D=]� �+&�Xa�(2?gives B all but the first five elements of < .

Page 14: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

4.7�

Arrays 10

4. A vector of character strings. This possibility only applies where an object has a ��������attribute to identify its components. In this case a subvector of the names vector may beused in the same way as the positive integral labels in 2.��� ������������ ������ ���� ���!"! � �# �$% �&'�#�'�)(�� +*',This option is particularly useful in connection with data frames (see - 4.9).

An indexed expression can also appear on the receiving end of an assignment, in which casethe assignment operation is performed only on those elements of the vector. The expressionmust be of the form ./��� � &0� � index vector , as having an arbitrary expression in place of thevector name would not make sense.

The vector assigned must match the length of the index vector, and in the case of a logical indexvector it must again be the same length as the vector it is indexing.

For example21��3� �546�/��� 1 *',7����8replaces any missing values in

1by zeros and

29��69 ��80,:�"��� 9;�69 ��8),has the same effect as29 �"�<�'=��>� 9 *

4.7 Arrays

An array can be considered as a multiply subscripted collection of data entries of the same type,for example numeric, logical or character string.

An array is defined by having a dimension vector, a vector of positive integers. If its length is?then the array is

?–dimensional. The values in the dimension vector give the upper limits for

each of the?

subscripts. The lower limits are always 1. Suppose, for example, @ is a vector of1500 elements. The assignment7A/� �B�C@�*��"���D�CEF$HG5$%I�8�8�*allows @ to be treated as a JLKNMLKPORQ�Q array.

Other functions such as ��� � � ��1 �RS%S%ST* and �0�"�U� 9 �RS%S%S3* are available for simpler and more nat-ural looking assignments in special cases, e.g. @V�"�<�0���#� 9 �C@F$W���3EF$HG5$RI08�8U*)* @V�"�X��� � � ��1 �C@Y$ZEY$[GU*�*The values in the data vector give the values in the array in the same order as they would occur inFortran, that is, with the first subscript moving fastest and the last subscript slowest. For exam-ple if the dimension vector for an array, say � , is �D�CEY$]\^$H_* then there are JBKa`�Kcbed�b�` entriesin � and the data vector holds them in the order � � If$%If$RI�,F$g� � _h$RIf$%I�,Y$iS%SRSU$j� � _h$]\B$3_0,Y$� � EF$k\^$3_), . To make life easier, ��� � � ��1 has a = 9 �#&'lm+n parameter for data presented by rowrather than by column.

Page 15: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

4.8�

Lists 11

Individual elements of an array may be referenced by giving the name of the array followedby the subscripts in square brackets, separated by commas. More generally, subsections of anarray may be specified by giving a sequence of index vectors in place of subscripts; however ifany index position is given an empty index vector, then the full range of that subscript is taken.Thus ��������� is a � ��� array with dimension vector ����������� and data vector

������������� , ������ ������ , �������!����� , ������"������ , ������������# , ������ �����# , �������!����# , ������"�����# ,

in that order. ���$�%�� stands for the entire array, which is the same as omitting the subscriptsentirely and using � alone.

Arrays may be used in arithmetic expressions and the result is an array formed by element-by-element operations on the data vector. The dimension vectors of operands generally need to bethe same, and this becomes the dimension vector of the result. So if & , ' and ( are all similararrays, then)+*-,%. ��/0&1/#'324(52-�makes

*a similar array with data vector the result of the evident element-by-element opera-

tions. The matrix multiplication operator is 67/#6 .

There are extensive matrix manipulation facilities, including transposes and eigenvalue,Cholesky, QR and singular-value decompositions. See help on 8 , 97:�;<9>= , �>?<@%A , B%C and DFE�G .

Any dimension of an array can be given a set of names using G<:�H7=<�FHI9�D , but is usually easierto use the facilities of data frames.

Matrices can be built up from given vectors and matrices by the functions �0JK:�=�G���L�L�L�� andC$JK:�=7G���L�L�L� . Informally, �0JK:�=7G���L�L�L� forms matrices by binding together vectors or matriceshorizontally, or column-wise, and CMJI:�=7G���L�L�L� vertically, or row-wise.

4.8 Lists

An S list is an object consisting of an ordered collection of objects known as its components.There is no particular need for the components to be of the same mode or type, and, for example,a list could consist of a numeric vector, a logical value, a matrix, a character array, a function,and so on.

Components are always numbered and may always be referred to as such. If 8�CN9�97D is a list,then the function A�9>=�;%8�?��8�CN9$9�D�� gives the number of (top level) components it has, specifiedas 8�C79%9�D����O�F� , 8�CN9�97D������#% and so on.

Components of lists may also be named, and in this case the component may be referred toeither by giving the component name as a character string in place of the number in doublesquare brackets, or, more conveniently, by giving an expression of the form)

name P component name

for the same thing. This is a very useful convention as it makes it easier to get the right com-ponent if you forget the number, and is strongly advised. You can find out the names of thecomponents by) =1��HK9�D�� names �

Page 16: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

4.9�

Data Frames 12

and this generates much less output that printing the object, which will achieve the same pur-pose.

The names of components may be abbreviated down to the minimum number of letters neededto identify them uniquely. Most of the datasets are in fact lists (or can be treated as lists), so wecould refer to the component ������� of the ��� �� data as ���������� . Similarly, many S functionsreturn lists of results.

It is important to distinguish ��� ������������ from �������������� . “ ��������� ��� ” is the operator used toselect a single element of a list, whereas “ ������� � ” is a general subscripting operator for vectors.Fortunately, numbered components are needed very rarely.

New lists may be formed from existing objects by the function !� ���#"������%$ . An assignment ofthe form& ���� ��('�)*!� ���+",�-�����/.0������213��465����7�5���.8�����2195+4;:-< !>=��?��.8���� �21@:/$sets up a list ����� of 3 components using the existing objects �����213�+4A�����2195 and ���� �21B:for the components and giving them names as specified by the argument names (which can bechosen freely). If these names are omitted, the components are numbered only.

Lists can be �C���-� D>5 -ed as well as directories, and this allows their components to be accessedas if they were stand-alone entities. Thus in the ������� example we could have& �C������D>5#"%�������� $& �?���CE+"%5����7�5��-$It is wise to ���>����DC5+"�F���������FC$ after use to avoid any nasty surprises.

4.9 Data Frames

Data frames were introduced in the August 1991 release of S, and can be thought of as closely-coupled lists of data vectors of the same length. Unlike matrices, the data vectors can be ofdifferent types, including character data. Both the rows and columns can be labelled. Considerthe data frame �< �G� from !��H����I#"J���K�!��>I?$ :& �<����

��� �>� 5/�L��-��:-��-� K-<CK���>EM�=��� !N�����K O�=���!P ! �>H����/� Q�R�S �CT�S RCUV1JW R�R+1 W R�XY���GQ21 WP !�����Z-� U�[ � � W219U T+1 Q [�W R+1 X1 1�1 1�1�1 1�1 1�1�1�1 1�1�1 1\ < �GXGS�Q XG[�U R�[]1JW^�CW W21JW U-WY��S�W21 W\ <CE�� X�T Q [ S UV1JR _�X+1 W X�Q [/�`1 W

which has both row and column labels. The columns can be treated as components of a list:

acb�d�e�f�g�b�hGb e�ijJk�l mCmonBp q`nsr tCtonBp uCtonBpvk>k�w`nsp u>t`nsp q`n,k ton@x ponBp q>u`nBp wCtonBp x�ponBpj,k�t�lMk�pCyonBp wCr`nspMk�pCponBpvk�y�x�nBp mCq`nsp xGp`nsp k�r`nBp yCronBp k�u`nsp r>q`nBpzkCk�ponBp qCronBpjBy>q�lMk�pCponBp uCy`nsp

and the structure can be treated as a two-dimensional array:

Page 17: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

4.9�

Data Frames 13

��������� ���������� �������

��� ���������� �"!#���$�%�'&�(�)�*+�,�!-���.

��������� �"!#���$�/���(0�1&�2-�3�0��4,5�(6���7*��1*���(189�;:<��� � &�(,)<*>=0:�( �

!#� ?6�6@ � �6A� B6A ?C.0. ��. ?,@�.

Note how the row label is carried along.

Data frames can be �1&0&��0D12 -ed just as lists can, and this allows their columns to be accessed asif they were named vectors.

A data frame can be created from vectors and matrices by the���1&�� � =�����)�(

function. For ex-ample:�E&���(0(1=<����)-(EF0GE���1&�� � =<���,)#(IH/��4�J)�K&���(�( � ���L2�(�4�M02�&<K;&<�0(0( � 2��N5�� � :)-(6K&���(0( � 5�OIf the columns are not named, they pick up the names of the vectors, so�E&���(0(1=<����)-(EF0GE���1&�� � =<���,)#(IH�&<��(( � ���P&<�0(�( � 2��Q&���(( � 5-Ogives

&<��(�( � �R&���(0( � 2>&���(0( � 5? @ � A S�. ?C. � A� @ � B B � ?C. � AA @ � @ B6A ?C. � �� ?6. � � S�� ?CB � �� ?6. � S @#? ?�@ � @�0���0�����

Character vectors given to���1&�� � =����,)�(

are automatically treated as factors (see T 10), unlessspecified within a U H"O function.

Page 18: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Reading�

data into S 14

5 Reading data into S

Data objects will usually be read as values from external files. This is done most convenientlywith the ������������ function. To read a vector from the keyboard we can use� ����������������������� �� ��"!� �� �#��$%�&�#� '��$�$%� (�� (#�&)#��$��*��$�$+��$�$+�, $�$-).��$/�0$%�1!2).� ���*�&'03*��4�'#�"!2�#�5��6���$�$�4�� ��6�4��5(�'�"!��.��(�).� 4�4��7!8(��"���9 or

�����������:���;�<������� �=�>�?!;�>�@$:��'@$�$�(>(=)A$��B$�$�$�$$�$C)@$/�0$D!2)E���F'G3�4�'F!G�>��6;$�$�4H��6�4H(/'�!G�>(/)E4�4�!9(=���Input is terminated by a blank input line (from the terminal only, despite the documentation) orby EOF (ctrl-D in Unix). To read in a character vector we specify the vector type by the secondargument:

�=I8J�K �>���;�<������2��L�L< M;NHO>P�QFRBP?M�R;QHO>NR�OBPHNFM;Q>QHP�NFO>RHMNFRBQ?M;PFO>O>QFM�RBNHP

To read from a file specify its name as the first argument, for example� ���������������;�<�����.�SL/��T IU1I �-�%L� Now suppose that multiple data vectors of equal length are to be read in in parallel. For examplesuppose that there are three vectors, the first of mode character and the remaining two of modenumeric, and the file is J ��V���� U"I ��� . Use �<�����.����& to read in the three vectors as a list, asfollows�;J �B���;�������.�SL J ��V���� U"I ���WLX�5Y J �-�#� J�I�Z L�LX�\[ Z 6��^] Z 62 � The second argument is a dummy list structure that establishes the mode of the three vectors tobe read. The result, held in

J � , is a list whose (named) components are the three vectors readin.

Matrices are usually read by row, as follows�`_ ���ba����Gc J [���<�����#�SL�Y JCd T�� U7I ����L< *�\�8����Y Z 3#�fe�]2c0��g Z<h The argument �-i J V Z to �<����� can be used to skip header rows of files.

Data frames can be read from a file by the c K � IjU �9��e8Y K function. The data file should be atable in one of a number of formats:

1. A file such as c0��� JCk9K c U"I ��� (page 39) which has a first row naming the columns, fol-lowed by the table of numeric data can be read by� c0��� JCk9K cF���=c K � IlU �9��e9Y K �SL-c���� J�k8K c U"I ���WLX�mT K � IGK c Z<h

Page 19: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

5.1�

Writing out data 15

2. A file laid out like the listing of a data frame. This has a first header line, and rows whichcontain the row label followed by the data for the columns, such as

��������� ��� ������������������� ��������� ������� ������!"����#��$�%�'&�(*) +�,�) (- (�( (�. +*+�&!"�����/�� -"0 +*+ 1324- ,325& 0�1 (625.2*2�2�2*2�2*2�2�2*2�2*2�2

Note that the header has one less entry than subsequent rows. This format is read by798;:�<"=?>�@98;A�<"=CB5DE<*F%G�A6H�I8�:�<"=JBK=�<�DLI�M

3. A table without any header. The row and column labels are then NLOQP�P�P�OSR and TLNLOP�P�PKT�U . However, if there exists a character column without duplicates, the first such istaken as the row labels and removed as a column.

Sometimes it is necessary to read in character strings which contain spaces. This can be doneby separating the fields in the file by, for example, tabs or commas:

7WVYX*8;:�<"=?>�@ZX�[�<*\]H�I8�:�<*=CB^=�<�D_I O X�A*`�aYI�b�DcI O G�d�X�D]H�X�DE<*D�A�a_I�I O =�A�<�D�eYX*a�f Og =�8%dh%A"8EX*a;f O `E:*`�=;A*\�a�f O 8�V�8�<�G"a�f Oji <�\�DEA�k�`�a�f Oml VEA�G"a;f�M�M

where b*D is the usual Unix abbreviation for a tab character. This device also applies to 8�A�<"=CB5D%<�F%G�A .

5.1 Writing out data

There are amny ways to write out data from S, for example the `�8Ed�\�D , [�<*D and l :�8�k_<�D com-mands. To write directly to a file, there are [�<�D , n 8%dD%A and, from S-Plus 3.2, n 8%d�DEAoBpD%<*FEG�Awhich is usually the simplest method. This can write a dataframe, matrix or vector, with syntax7 n 8%dD%AoBpD%<�F%G�ALH$=�<*DE< Ojl d�G�A�a_I�I O X�A*`�aYI O I�Mand further arguments can be found in the help page. By default it writes out comma-separateditems on rows, but the separator can be changed to space or tab (

I�b�DcIin Unix).

The function n 8%d�DEA writes a vector, with syntax7 n 8%dD%A6H$=�<�D%< Oql d�G�A"aYI�=�<�D%<�I O \%[�:�G*V�k�\YX*a�r�Mfor numeric data, and in one column for character data. To write out a matrix

k, use

7 n 8%dD%A6HsDJHKkcM Oql d�G�A"aYI�=�<�D%<�I O \%[�:�G*V�k�\YX*a�\�[�:�GLHKkcM�MThe function l :"8�kY<*D converts data to a line of characters, and can be used with n 8%dD%A or

[�<*Dto construct custom reports.

Page 20: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Graphics 16

6 Graphics

The graphical facilities are central to S. The steps involved are as follows:

1. The type of terminal, or device, is declared to S at the beginning of the session:�������������� ����

2. A command is issued to construct a plot from data. For example�����������������

specifies a simple point plot where�

and�

are vectors giving the � - and � -coordinatesof the points respectively. (The command includes a default automatic choice of axes,scales, titles and plotting characters, all of which can be overridden with additional graph-ical parameters that could be included as named arguments in the command.)

6.1 Graphical Parameters

Functions producing graphical output usually have optional additional named arguments thatcan be specified to override some default parameter settings and hence modify the character-istics of a plot. A short list of the main ones is as follows:

� �� ��!#" If $&%�'#(&) all axes are suppressed. Default *�+&,#) , axes are automatically constructed.

�������#!�-�.�-Type of plot desired. Values for / are:0 for points only, (the default for function 021�354 ),1 for lines only,6

for both points and lines, (the lines miss the points),798 ( for step functions ( 7 specifies to change now, ( to change just before thenext point),3 for overlaid points and lines,:

for high density vertical line plotting, and; for no plotting (but axes are still found and set).

�� ��< !�-#�&� =>���?@-�� ��< !�-#�&� =>���?@- Give labels for the A – and/or B –axes (default: the names, including suffices, of

the A and B coordinate vectors).�5C < !9-#�5�D=�>&�#?@-E�� >5�D!�-#�&� =>���?@-

7GF 6 specifies a title to appear under the A –axis label and HI#J ; a title for the topof the plot in larger letters. (default: both empty).

��D> E !D.K�G��L�NM�>����D> E !D.K�G��O�PM�>�� Approximate minimum and maximum values for A – and/or B –axes settings. These

values are automatically rounded to make them “pretty” for axis labelling.

Other graphical parameters control the background characteristics of all subsequent plots andare usually specified by a call to the function � � =Q�SR�R�RT� . There are a great number of theseparameters and the command��M�������� � =�

gives a complete list of them and their meanings. Some of the more commonly adjusted onesare as follows:

Page 21: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

6.2�

Some Basic Plotting Functions 17

��������Line type is . If lines are being plotted, a variety of line types is available; ��

means a solid line, ���������������� indicates a variety of broken line forms.

����� ��� � � Specify the character to be used for plotting points (default: � for graphics ter-minals, for PostScript).

!�"$#$%�& � �(')!+* �-,!�".�/% �0� �(')!+* �-, multiple frames on the one plot. Instead of plotting just one graph per screen,

each screen (or page) will contain an array of 1.�� graphs forming an 24365 grid.If 1/7�8:9<; is used the screen is filled row-by-row and if 1�7/=<9�> is used it is filledcolumn-by-column. Useful if many graphs are to be inspected simultaneouslyand high resolution is not necessary.

� ������� � � Specify the type of plotting region currently in effect. Possible values for = are? to generate a square plotting region;1 (the default) to generate a maximal size plotting region.

6.2 Some Basic Plotting Functions

The elementary plotting functions are as follows:

� � % � 'A@B* � *DC�CEC , Scatter plot of points with @ – and�–coordinates given by the two

main parameters. The pair @B* � may be replaced by a single list withcomponents labeled @ and

�, called a ‘plot list’.

Graphical parameters are particularly useful.��%$F ���-G 'H@B* � *�CEC�C , Add points to an existing plot (possibly using a different plotting char-

acter. Follows on from a � � % � 'ICECEC , command.� F ��J$G 'A@K* � *LCICEC , Add lines to an existing plot. Similar to points.

NoteM � � % � 'A@K* �-,�NO� F �.J�G ' G � � F �$J 'A@P* �-,:,will join the points of a plot by a cubic spline interpolation function.(See � J/� �P' G � � F �$J�, for further information.)

��J @ � 'A@B* � *��Q�R.J/��G *�CEC�C , Add text to a plot at points given by @K* � . Normally

��Q�R.J/��Gis an in-

teger or character vector in which case�/Q�R�J/�$GTS F�U is plotted at point'A@ S F�UP* �+S F�U , . The default is VXW �/J���Y/� �P'A@ , .

Note: This function is often used in the sequenceM � � % � 'A@K* � * ��� � J:���E�Z��,(N[��J @ � 'A@B* �-,The graphics parameter

��� � J0�-�\�Z� suppresses the plotting of pointsbut set up the axes, and the

�.J @ � 'ICECEC , function supplies special char-acters (in this case just the integers by default) for the points.

Q�R.� F ��J ' Q * R *�CEC�C ,Q�R.� F ��J 'H� � �X*�CEC�C ,Q�R.� F ��J 'H] � �X*�CEC�C ,Q�R.� F ��J ' lmobject *�C�CEC ,

Draw a line in intercept and slope form, (Q,R

), across an existing plot.� � � may be used to specify�

–coordinates for the heights of horizon-tal lines to go across a plot, and ] � � similarly for the @ –coordinatesfor vertical lines.

6.3 Interaction with Plots

S-Plus allows users to interact with plots, by identifying points and by adding information atplaces selected by mouse clicks.

Page 22: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

6.4�

Brush and Spin 18

�������������� ����������������������On a current plot of ����� , clicking the LEFT mouse button placesthe appropriate string from "!$#&%� near the point which has beenclicked on. Click the MIDDLE mouse button to finish. If "!$#&%� is omitted uses index numbers, and always returns the indices ofselected points.

��'�(��)��'+*, -�Returns a list of vector coordinates of points clicked by the LEFTmouse button. Click the MIDDLE mouse button to finish.

��'�(��)��'+*, &�/.$01.2�ditto, but plots the points as in 3� �465 .

���)7��)�&�8 $��'�(2����'�*, 9���;:2:�:<�Add a legend box at a mouse-selected point (one LEFT click). Seehelp page for the box contents and other options.

��'�(��)��'+*, -�is often used with

�=�����to add annotation to plots, e.g.

> �=������ 9��'�(�����'�*, 9���/.6(�'��2�;*;'2�;��.2��?@�������� $��'�(����;'2*8 9���9.+(����"�&��.2�

6.4 Brush and Spin

These are S-Plus enhancements to allow dynamic manipulation of graphs. Spin allows threecolumns chosen from a matrix of data vectors to be rotated in space.>BA ���)0� C."���=�����&."�> �60�6�� -�6���)���,:D�;E2E��

Use the left mouse button to select three of the variables, then use the cross-shaped pad to rotatethe point cloud. Finally click on F"G ��� .> �;* G � A /���=�����H:I��E�EJ� A �����&K"L�

includes�60���

and a0��&�)*��

plot. Additionally one can ‘brush’ by selecting points with the leftmouse button, and de-selecting them with the middle button. One can mark points in differentways, with the four symbols, and even label points if

���������is selected.

> �;* G � A <*2�M���&�N /��*�����O2���9P6Q��<�)*;���HO)���<R�QJ�9��*��2�8O2�2��S�Q;���

Now select the first 50 points with one symbol and the last fifty with another. The intermediatenature of the middle 50 then stands out.

6.5 Equally-scaled plots

It is sometime necessary to make geometrically-square plots, for example so that distancescan be assessed accurately. This is somewhat tricky, but done by the functions

� F �2(�0=��')� in�&�6�&*��+*2�J T*=�/0������M�, which adjusts the axis scales to be equal within the current window shape.

Page 23: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

6.5�

Equally-scaled plots 19

Figure 1: Screen dump of an �������������� �� window displaying ��������� on the ������ data, withdifferent highlights for the three groups.

Page 24: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Statistical�

Summaries 20

7 Statistical Summaries

7.1 Arithmetical Summaries

Standard summaries such as ������� , ����� ��� and ��� � are available. The ���� function will take adata matrix and give the variance-covariance matrix, and ���� computes the correlation matrix,either from two vectors or a data matrix.

There are also standard functions ����� , ����� , ��������� and � ��������� ��� . The functions ������� and ����will compute trimmed summaries. More sophisticated robust summaries are available, such as����������� ����� � and !"�������#�$����� as well as via the ����%���!�� library.

7.2 Histograms and Stem-and-Leaf Plots

The standard histogram function is &���!��(')�+*-,.,/,/0 which plots a conventional histogram. Morecontrol is available via the extra parameters. The parameter 1�����%���%�� �2��� 324�5 gives a plot of unitarea rather than cell counts, and �������2!�! sets the number of bins.

Densities can be estimated via the function �����!�����3 :

&���!6�7'8&�!���� ���+*9��������!�!�4�:�;<*912�=��%���%��"�2����3�4 5(*>3����/��4��?'@;A*);(�); :20�0�2������!B'C=����! �6��3<')&�!6����� ��0 0�2������!B'C=����! �6��3<')&�!6����� �+*ED���"��&�4�:�;�0?*F����3G4IH�0See figure 2.

50 100 150 200J

0.0

0.00

50.

010

0.01

50.

020

hstart

Figure 2: A histogram of &�!���� � � with two density estimates overlaid.

Page 25: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

7.3 Boxplots 21

A stem-and-leaf plot is an enhanced histogram:������������� ���������

��������� � ��������� ��� ! "#�%$&�' ���(�)��*!�+� �����,$-"/.102�3$��"#�

4 �!5+�6�)�,*87�9:� �!�;�,� � 7�*!�,5,�<��9=�!��=����>,!�?9�@A�!��B5!9,*,93�

$DCE$FDCE.,.� :GH CE$,$ H G!G�ICE.� ! :$!F HGDCJ�3.� :$ H,H G�3�DCE�,�3K�$!F�,�LCE��K�$,F���3.DCE�,.� �K�F,F,F!F H!H�H,H �!G!G� ICE�%�!�3.� �K,K!K�$!F H G,G�MKNCJ�3.!.,.� , , (K,K H G,G,G�3$DCE�,��� (K�$���3FDCE�%��$,G� H CEF,F� �ICE. H�3GDC H,H.,�DCE�%�� , ! �K,K�$,F!F H.%�LCO ,�.,.DCEF��.� ICJ�MK

Apart from giving a visual picture of the data, this gives more detail. The actual data, in sortedorder, is roughly

$!$10EF!.102F!.10EF� P02F!G102QMQMQand this can be read off the plot. Sometimes the

pattern of numbers (all odd?) gives clues. Quantiles can be computed (roughly) from the plot.

7.3 Boxplots

A boxplot is a way to look at the overall shape of a set of data. The central box shows the databetween the quartiles, with the median represented by a line. ‘Whiskers’ go out to the extremesof the data, and very extreme points are shown by themselves. It is also possible to plot boxplotfor groups side-by-side:�A*:� R:�:���(SP�T��� 7:*!� S)��=R�9�U!7�*,93�P�M� 7�*,���V����9 �,���M� 0 5�S�5,*!�W����9��!���M�X��� 0 �����)�:� � ��9��!�, " �3R�R)�

divides a time-series into months, and plots the boxplots for each month on one plot. See fig-ure 3. Other styles of boxplot are available—see the help page.

Page 26: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Distrib�

utions 22

3040

5060

Jan�

Feb�

Mar�

Apr�

May�

Jun�

Jul�

Aug�

Sep�

Oct�

Nov�

Dec�

Figure 3: Boxplots for months of � ������� data.

8 Distributions

S has functions built it to (approximate) the density, cumulative distribution function and quan-tile function (the inverse of the CDF) for many standard distributions. There are also functionto simulate samples from these distributions. The first letter of the name indicates the function,e.g. ������������������������������������� respectively.

Distributions available are:

Distribution S name parameters

beta �� ��! "�#�! ���%$&�'"�#�! ���)(binomial +*���� ")*�,-�.���/�-� Cauchy 0�! 1�0 #)2 3)�0)!��4*��5��"60)!�3)�chisquare 0�#4*�" � ��7exponential ��8)� �-!����F 7 ��7&$&�9��7�(gamma :�!����4! "�#�! ���geometric :��);� �/�% hypergeometric #�2)����� ���<5�<=log-normal 3������ �4��! �3) :>��" �%3)�:logistic 3� :+*)" 3)�0?�@"60)!)3��negative binomial � 4*;��� ")*�,-�.���/�-� normal ����� �4��! >�@" �normal range -�-!�):�� ")*�,-�.�@" �Poisson ��/*�" 3)!;�/ -�-!stable ";��!� *�-�-��85�'"�=���A)��/")"T � ��7uniform 1�4*;7 �B*;5�C�+! 8Weibull A��/*; )1�3�3 "�#�! ���?�'"�0)!�3)�Wilcoxon A+*�3�0) 8 ���<

Page 27: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

8.1�

Q-Q Plots 23

The function ��������� re-samples from a data vector, with or without replacement.

8.1 Q-Q Plots

One of the best ways to compare the distribution of a sample � with a distribution is to use aQ-Q plot, of which the normal probability plot is the best-known example. Q-Q plots can alsobe used to compare two samples. For a sample � the quantile function is the inverse of theempirical CDF, that is

quantile ������������ ���� proportion � of the data �����

The function � !���"$#&%'�)(+*)(-,.,.,0/ plots the quantile functions of two samples � and * againsteach other, and so compares two samples. The function � !1�"!23�)%'�4/ replaces one of the samplesby a sample at the quantiles of a standard normal distribution. This idea can be applied quitegenerally. For example, to test a sample against a 576 distribution, we use

����"3#�%8 �#�%'����":9�1�#��;%'��/<(>=�/?(��!"@2�#�%>�4/A/where ����":9�1�#4� computes the appropriate set of probabilities for the plot.

The function � :�:9�1 helps assess how straight a � !1"@23� plot is by plotting a straight linethrough the upper and lower quartiles. (See the example in B 3.)

Page 28: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Classical Statistics 24

9 Classical Statistics

S-Plus 3.1 has a section on classical statistics. The same functions are used to perform testsand to calculate confidence intervals.

The table shows the amount of wear in a shoe experiment with 10 boys, an experiment reportedin Box, Hunter & Hunter (1977), Statistics for Experimenters. There were two materials ( � and�

) that were randomly assigned to the left or right shoe.

����� � �

� ��� ������� ������������� �� ������� ����������� ���� ������ �!�"��������� ����#�$����� ������������% ����#&$����� �!�"��������' '� '������ '�(�)�����& � %������ ���������� ����#�$����� �!�"�������� ��#�$����� ����������*� ���#�$����� ����'������

We can use these data to illustrate one-sample and paired and unpaired two-sample tests. Therather voluminous output has been edited:

+-,/.1012*,43*56,�7*8/9;:1<(=!>,@?A:#BDC*EA<GF�C*EIH*HJLKMJONQPSRTJVUWP ENQKYX"PSR6X"PSXZQKMJ�JLPSRTJ E P\[] KMJ^UWPSNTJVUWP\R[QKMJ E P ] J*JLP\XJ*J�K`_QP\_-_QPaUJ/N"K`[QP\Z-[QPSXJ/Z"KbJ E PSXcJ*J�PSNJ ] K`[QP\N-XQPSXJ/["KbJ/N"PSNcJ/N"PS_RdJ�K+-8e?�?!8*7O.":^,/.I0D2*,�H+f? P ?!2*,@?A:#BQ<hg1iIC J EIH

j 9k2*5�,*8^gDlI=D2m?I5en!2�,e?o 8@?!8 K B?pC6E P\XdJOR ] < oq C [ <Glk5er!8�=�ik2fC-E PaU�N ] N8s=s?!2@t9d8@?d>Or!2b.Dul10?s.d2*,�>, K ?*tsid2bg�2*8/9v>,w9I0s?x2syeid8s=6?�0 J E[*Z lk2et!7�2/9�?z7�0�9 q > o 2/9d7�2->V9�?!2@t*r!8�= KXQP\X ] _sUDR ] JORQPSN�X*N�Z ] N

,*8�g1l1=1262�,e?d>�gL8@?!2*, Kg�2*8O9p0 q-{

Page 29: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Classical Statistics 25

���������

� � �������������������� �! �"# �%$ &'�(&�)���*"+�),��+�����&���-�)��. ��/0�2143����5�"� ��6 ��78� 6 3��:9# �%$;����<�-�= �6 ����> � ���������1@?2A8B �C� �

D > . ��FE �6 ����>����,� �G ���5H2I�/ . �"JK8����

H . . 9 �� �G �L��H2I�/ . ��J;�� . �� �NMOB ��* 1P�,B �C� 1PQLI�7 . 6 AL��B ����-�-����. 6 ��/�� . 7 �SR�T�Q2���R���� �09U�/�A��S?�A �V�8��,��W5A . 6 "� �C�

� � ���������1YX��

Z . �2H . /"HK[�="�2I Z . ?�Q 6 �V8I�[ ����

H . . 9 � . �2HKXOB\I �'�(����&�< 1]H��OB �C& 1UQ�I�7 . 6 A��^B �'�()L�C��-. 6 ��/�� . 7 �SR�T�Q2���R���� �09U�/�A���H ���8��/8�C����� �^?_� . ��� �N�8��O��W5A . 6 �� �<�- QL��/ ���C�"\�5�5��� H��C����� �" ��/�7 . 6 9I +���)�*�*"<�+�* �_�(<�+�*"<�+�*� . ?2Q 6 �F���� ? . ���09?`� . �O���;>N?`� . �a���FT

��������� ���`�(��*

� ���������1PX'1P7 . / � ��W5A . 6 B�b �

E � 6 ��R;c���H � �5HK[�="�2I Z . ?�Q 6 �4 I�[ ����

H . . 9 � . �2HKXOB\I �'�(����&�< 1]H��OB �C)'�(<�&�) 1dQ�I�7 . 6 A��^B �'�()L�C��-. 6 ��/�� . 7 �SR�T�Q2���R���� �09U�/�A���H ���8��/8�C����� �^?_� . ��� �N�8��O��W5A . 6 �� �<�- QL��/ ���C�"\�5�5��� H��C����� �" ��/�7 . 6 9I +���)�*�-���*"� �_�(<�+�-���*��� . ?2Q 6 �F���� ? . ���09?`� . �O���;>N?`� . �a���FT

��������� ���`�(��*

� � ���������1YX�1PQ . / �5H�B�[��

e . /8��HK8I�[ ����

H . . 9 � . �2HKXOB\I �'�(��*�&�< 1]H��OB < 1YQ�I�7 . 6 A���B ��������&�-. 6 ��/�� . 7 �SR�T�Q2���R���� �09U�/�A��S?`� . �\���OH ��� ��/ ��������� �N�8��O��W5A . 6 �� �<�- QL��/ ���C�"\�5�5��� H��C����� �" ��/�7 . 6 9I ������&���<�-���< I ���f�C������*"��������

Page 30: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Classical Statistics 26

���������� ����������������������� �"!#�$&%('�)

* �����+(� -,.� �0/(,��$�1�!��0, 2��������

$������43 �.�",�$5�� �0/(,��$�1�!��0, 26,#(!�78�+�9�-�#�-����-���(�6���"�(:;�(!&!��&�-����+,=<>%91+?���@�@ A&BC����1-D#��+E���%>F��GF�HI@KJ

The sample size is rather small, and one might wonder about the validity of the L -distribution.An alternative for a randomized experiment such as this is to base inference on the permutationdistribution of M . Figure 4 shows that the agreement is very good. (As the computation of thisfigure uses some subtle ideas in S, it is omitted: see Venables & Ripley (1994, Chapter 5).)

-4 -2 0 2N

4O

0.0

0.1

0.2

0.3

0.4

diffP

diffP-4 -2 0 2

N4O

0.0

0.2

0.4

0.6

0.8

1.0 Permutation dsn

t_9 cdfQ

Figure 4: Histogram and empirical CDF of the permutation distribution of the L -test in the shoesexample. The density and CDF of LSR are shown overlaid.

The list of classical tests is:

TVUW8XY[Z]\_^�`-\ a�bcU�`&deZ]\_^�`-\ a�XKfgZh\_^�`\ iVU#`(b_^KfgZh\8^�`-\i�f_U�^ M YVj&WkZl\_^�`-\ m�f nV`m_j�opZ]\�^�`\ YVj&W�\_^#o�b_j�^&WeZh\�^�`\qYVa&W8^Ycj feZ]\_^#`\r f�X r Zh\8^�`(\ \kZh\8^�`\ s8jKfgZh\_^�`\ tVU o�a#X&ukZh\8^�`-\

Many of these have alternative methods – fora#X feZh\8^�`(\

there are methods v r ^#jKf8` X�W v ,v m8^&W M j#o�o v and v ` r ^�jKfKYcj&W v .

Page 31: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Handling�

Categorical Data 27

10 Handling Categorical Data

Consider a (fictitious) survey of shoppers in Britain. Amongst the variables collected for eachperson surveyed are sex, age, TV area

�, social class

�, transport used for this trip to the shops,

and total spend at supermarkets. The possible values of these variables are

sex: M, Fage: –24, 25–44, 45–59, 60+TV area: 1, ����� , 12social: A, B, C1, C2transport: car, bus, cycle, footspend: positive continuous

This provides examples of each of S’s types of categorical data structure. There are two mainstructures, categories and factors. The latter were introduced in the August 1991 release, andhave almost entirely superseded the use of categories. A factor is regarded as a vector over theset of levels which have no implied order. Thus sex, TV area and transport are all factors. How-ever, TV area is coded by number rather than by the names of the companies. These variablescan be declared as

������ ���������������������� ��� � !"�# �$��� ���%�� %������������� "�# ���&��� � !�'�&��()�+*������,�� -�.�����.�/�0�1�'�&��(��2* �/���3�4� �+�.�'!

Internally in S levels are numbered in alphabetical order, and when factors are used as treat-ments in designed experiments, the order of levels may matter. For example, if we want tocontrast females with males (rather than vice versa) we need to specify the levels of the factorexplicitly:5 ����6�� -�.������/�0�����+7�8������:9<;���=.��;'�> �?�A@CBD@E9�@CFG@�!�!

Social class is an ordered factor in that the classes are perceived as ordered, with “A” (profes-sionals) regarded as highest. We can declare an order by

�����'H���;%�� I�����&�/� �����$�����+�.�/�?������'H��;0���&��� �'!�!;��=���;'�G��������H���;�!J�� I;���=.��;'�E�������'H��;'!:KML7NPO+Q��R.�I�� ,�/���&�/�&�/���$�����+���/�?�C��R �S��� ��� � !:9

;��=���; ��>&�:�A@2 �T�LU@E9V@T�W� �L�LD@E9X@AL�W� �W�Y�@G9V@/Z�[/\)@�!�!

The first line orders the levels by the default (alphabetical) order. The second shows how theset of levels may be changed, in this case by reversing the existing ordering. Age is an orderedcategory for which it is necessary to specify the levels explicitly. Had ��R��S�8�&����� been specifiedas a continuous variable, it could have been categorized using �]�� (whose help page gives otherways to produce the categories):

��R.�0�^�/� ������� I�]��S�C��R.�0�8�&��� ��9<�:�C[_9`T�W�9aL.W_9`Z�[�9bY�Y'!�!��R.�I�� ,�/���&�/�&�/���$�����+���/�?�C��R �S�4�/� �+�.�'!G9

;��=���; ��>&�:�A@2 �T�LU@E9V@T�W� �L�LD@E9X@AL�W� �W�Y�@G9V@/Z�[/\)@�!�!cBritain is covered by 12 commercial TV companies, so this provides a simple geographical variable.dDerived from occupation.

Page 32: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

10.1 The Function ������������� ����� and Ragged Arrays 28

Some of the functions for statistical models treat ordered factors in appropriate special ways.

10.1 The Function ��������������������� and Ragged Arrays

To continue the previous example, suppose we have want to summarize spend by some of thefactors To calculate the sample mean income for each age-group we can now use the specialfunction ��������������������� : "! ��#�$&%('*)+#���$ !-,�. ������������� ! �/#�$&%102��3�#405)6#���$6�giving a means vector with the components labeled by the levels

"! ��#�$&%('*)+#���$ !.�7�8 7�9�.�8�8 8�9�.�9�: ;�<>=7>? ' 7�< @/9 ' 9>@ @�@ ' 8�7 AB? ' ;�9

Suppose further we needed to calculate the standard errors of the mean spends. To do this weneed to write an S function to calculate the standard error for any given vector. We discussfunctions more fully in C 12, but since there is an inbuilt function D��>EF�������G� to calculate thesample variance, such a function is a very simple one-liner, specified by the assignment: "! �&%H#>E�E ,�.JI�K $�L��+M�N�$O��P6� !�Q ER����D��RE1�SP+��T��R#�$�3R��UO��P+�R�After this assignment, the standard errors are calculated by "! ��#�$&%(' ! �/%/#>ERE ,�. ������������� ! ��#�$�%�0V��3�#40 ! �&%/#RE�E��and the values calculated are then

! ��#�$/%(' ! �&%H#�E�E.�7�8 7�9�.�8�8 8�9�.�9�: ;�<>=@ ' ?/< 7 ' @�@ 8 ' 9�9 7 ' ?H<

The function �����������O����� �G� can be used to handle more complicated indexing of a vector bymultiple factors. For example, we might wish to split the spend by both age and sex: ������������� ! ��#�$�%�0V�&M ! ������3�#40 ! #�P6�W05)+#���$6�The combination of a vector and a labelling factor is an example of what is called a raggedarray, since the subclass sizes are possibly irregular. When the subclass sizes are all the samethe indexing may be done implicitly and much more efficiently by using arrays. The function��������� is the analogue of ����������� for arrays.

The pattern of our survey can be seen by the ����X���# function, which takes a listing of factorsand returns the contingency table as an array, e.g. ����X���#W� ! #�P(0V��3�#40VY�Z['\�>E/#��40 ! N�L&MR���405�&EH��$ ! ��NRER�+�

Page 33: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Loops�

and Conditional Execution 29

11 Loops and Conditional Execution

Commands may be grouped together in braces, ����������� ����������������������������� . The value ofthe group is the result of the last expression in the group evaluated. Since such a group is alsoan expression it may, for example, be itself included in parentheses and used as part of an evenlarger expression, and so on. This facility is most often used with the control statements of thissection.

The control statements are very close in spirit to those of the C programming language, andonly a few are mentioned here. There is a conditional construction of the form�������

expr �� expr � ��!#"�� expr $where expr must evaluate to a logical value and the result of the entire expression is then evi-dent.

There is also a �&% � –loop construction which has the form�'�(% � � name

��)expr �� expr �

where name is a dummy, ������� is a vector expression (often a sequence like *,+.-�/ ), and ��������is often a grouped expression with its sub-expressions written in terms of the dummy name.�������� is repeatedly evaluated as name ranges through the values in the vector result of ������� .As an example, suppose

�0)1is a vector of class indicators and we wish to produce separate

plots of 2 versus � within classes. Use the 3&��!�� facility to understand the following:

� 2(465�7�"0�(! ��8�� 2:9 ��)�1 �,� �&4;5�7�"��&! ��8�� �<9 �0)1 ��'�(% � �=�6��) *>+?!�� )�@�8 3 � 2(4�����0�&! %08A� �#4�B�B ��C�C 9D2#4�B�B ��C�C �,�E F�G ! ��) � � !" �&�08A� �&4HB�B ��C�C 9I2&4HB�B ��C�C �����(Note the function "0�&! �08A� �����J� which produces a list of vectors got by splitting a larger vectoraccording to the classes specified by a factor.)

Other looping facilities include the� �K���(� F 8 expr

statement and the�'L 3 � !�� � condition � expr

statement. The G �K� F�M statement can be used to terminate any loop abnormally, and) ��� 8 can

be used to discontinue one particular cycle.

Loops in S are often memory-hungry, and care may be needed not to use up all of your com-puter’s memory. Expert advice is necessary on work-arounds.

Page 34: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Writing Your Own Functions 30

12 Writing Your Own Functions

As we have seen informally in � 10.1, the S language allows the user to create his or her ownfunctions. These are true S functions that are stored in a special internal form and may be usedin further expressions and so on. In the process the language gains enormously in power, conve-nience and elegance. Most of the functions supplied as part of the S system, such as ������������ and ����������� and so on, are themselves written in S and thus do not differ materially from userwritten functions. (However, increasingly such functions are being re-written as internal func-tions to gain efficiency.) Listing these functions (by printing their name without parentheses)is a very fruitful way to gain hints for writing your own functions.

A function is defined by an assignment of the form� ������������������� �"!$# �%� arg &�' arg ()'*��+ expression

The expression is an S expression, (usually a grouped expression), that uses the arguments,arg , , to calculate a value. The value of the expression is the value returned for the function. Acall to the function then takes the form �������-� expr &.' expr ()'/��� and may occur anywhere afunction call is legitimate.

For example, the 0.1�2 function in 3)!�4)�)���$5���)# 4.��6��" is defined as:

0.1�27���8������� ��!.# �%�95� :�;���=<$�>� ���"!$3��-��5?'@�A�CBED�FG'�BIHCFJ � �?KED�LM���?K/NOLP

This first computes the quartiles, then returns the last value computed, their difference.

Note that any ordinary assignments done within the function are temporary and lost after exitfrom the function. Thus � is not left behind, and does not affect any other object � .

If global and permanent assignments are intended within a function, then the ‘superassign-ment’ operator, ‘ ����� ’ can be used. See the Q���3�R documentation for details, and see also the6�5��>��QJ�)#��"!�S��-�� function.

As a second example of a useful function, consider a function to evaluate the ‘Huber proposal2’ robust estimator(s) of location and/or scale:T�U V)W�X�Y*Z []\�U�^J_�`)a�bc^edgfAhji�k�l�monehjp$UGh@Y�hqa�^>a+`�p$U�krp>W�s�a�t+^edgf)u>hv`�b�w8k�l>myx$W [�z�u{

f�Z�[|f~} ��a�Y�m�^Jt>dIf)u+�^�Z�[�w$W/^.� `OT-dIf)ua+\�d�p"aOY�Y�a�^���d�p.U>u u {

p$U�x8Z�[�a�^Ca+`�p.U^"l�Z [�^J[�l

� W�w.Y�W {p$U�x8Z�[qp$Up$U"lrZ�[qp$U^"l�Z [�^

�a+\�d�p"aOY�Y�a�^���d�Y�u�u {

Page 35: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Writing Your Own Functions 31

����������� ���������� � ���������������������

���� �����! �"�#�$�%&�'�)(��*�+�, � � !�-� ���+. (�/&�! 0�1��� ��� �2���+ �(3 ��&#�$�%1�'�)(��% � " � � �

�-������"4�657#8�9"4�:1;<�=�?>�����(@ A���CBD�E�:BF�?>�� . (3 !���G�5IH<�J�65��-�?57#?KL�=�M>N�-� �?> �������O>��P���-�E��QO#5IH<�J�65��-�?57#?KL�R�?��� �

�-�!�-�!�I>4�'�����-�!�S�M>:���-/1�G��QI# ����*�-�!��T�% � �R�-��Q , � � U�

�5IH<���R , ��J�M>M�2�S�M>:���*� � $ � !�����WV-V� , ��7�����!�M���*� � $ � ������

, % � &(�M>��!�-�S�M>:����������M��

� 5�� � �=�M>�X��?>��<BY�AX+�4�G��

This allows either of the location Z�[ and scale \ to be specified. Optional arguments are theparameter ] , the initial value for ZE[ and a convergence tolerance. The first line removes allmissing values. The ZL^G\�\G^�_�`ba1c function checks if a parameter is supplied. Two constants arethen calculated as functions of ] . The rest of the function is a loop. In general loops are ineffi-cient in S and should be avoided if at all possible, but here we have no choice as the calculationis iterative. Finally the function returns two components, the location and scale.

It is sometimes useful to be able to time commands:

d�e [�f6^&Z g�h�i�j�[G_ d f6^?k-_lanmc�\�[MZoap[�_ ^�mrq=f6^&Zg8a)m6cPs�i?t�u�cg�vGw e \?gUx@h�i�j�[G_ d f6^?k-_lanmc*[�_6^�myqzf ^4Z g<apm c8s)t?uwhich return the total cpu time and the elapsed time taken by a command or sequence of com-mands enclosed in {lq�qGqp| . Note: as these are functions, assignments inside them are in theframe of the function rather than permanent. Alternatively, use e�} k d qzf6^&Z g~a1c before and aftera group of commands.

Page 36: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Statistical�

Models 32

13 Statistical Models

These facilities form the heart of the 1991 version of S. They are based on object-oriented ex-tensions, so that generic functions such as �������� know what to do with the results of variousmodels. The two most basic notions are a data frame ( 4.9) and a model formula.

13.1 Model Formulas

A model formula couples a y-vector with a model expressed in a terminology very similar tothat of GLIM and GENSTAT. The form is� ������������� ������� ���� ����� �for the linear regression of

�������on

��� ������� �� and ����� � . Factors are replaced by a set of in-dicator variables for the regression, and can interact via the � operator (not � as this is a validcharacter in a variable name). Thus we can have all the following constructs:� � �"! � � � � � ��� � � �#�#� � ��! ����� � � � � ��� �%$&�#�#� � ��!������ equivalent to� � �"! � � � � � ��� �('��#�#� � ��! ������)� ���*���+�� �,�.-�� ��� ��/�0���0�0 �"� � nested layout� + � �1� � +#� ��2 � � �1�3�1�3� ��� parallel lines� 4�� � 4 � 5�6.� �#� � ���1�+ line thorough the origin� 4�� � 4 � � ���-87 �*� � �#���+89;:�< quadratic polynomial� 4�� � 4 � � �=7 �#� � ���1��+>9@?A9���������� 4 ����#B�C�< natural spline� 4�� � 4 �)�D7 �*� � ������+�< smooth function, for EFHGThe syntax of a linear-model fit is

� ! 7 model formula 9 data frame <where the names in the model formula refer to columns of the data frame, which can be omittedif it has already been attached. For example� � � 0 � � � -87 ����� � � - <� � �� �4��I7 � 200 ���#<� � - �*� �KJL� !,M 5(� ! 7N��#��O�.��� �� � ����� � <�)�12 !�! � � -87 � - ��� �PJQ� !3<� � � ��R��P7 � - �#� �PJL� !S<� 4�� ��TT3� 4 �U���� �D7 � - ��� �KJV� !S<� � �� � 7 T �������W� 7 � - �*� �KJQ� !=<P9X�*� � ��� 7 � - �#� �KJV� !S<�<This show how to extract information from a fit by the use of ancillary functions. There are nostandard ancillary functions for standardized and Studentized residuals, but I have added themas� �����#� �=7 < and

� � 2 ��#� �=7 < in� � 0 � � � -I7 ����� � � - < .

Page 37: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

13.2 One-way Layouts 33

13.2 One-way Layouts

The analysis of one-way layout is best illustrated by an example. The table gives data on ob-served concentrations (ng/ml) of a chemical in groups of 10 patients after oral administrationof almitrine bismesylate:

�������������� ������

������������ ��� ��� ����� �����

� �! "�� ����# ����"� !$# �%��� �%&'� �% (� ��� )'� �%&(� �%)�)! !$" �%��� ����� ��"��� �(� )(� � � ���!# ��� "�� ��"�) �%)�)& �� "�� ����" �%)�)) ��" )(� �*!+� �,&(�" �%& ����� �*!�& ������%� �(� "�" ��"�# ��"�#

- ���(�/.10�243���5$�/�'6���57 98$�4/:����; 9.$<��= >8'���Function to compute st. dev.- ��?$��@A6���<�B40�21���<�57 ,C��/?+�,@�6���<�BEDF�(</��C��

- �(���G0�2H�(�/I; J�K L���=MN���OMP�����OMQ������RMP�����Label the observations by dose- ��������I10�243$<��/�$���= S�������Make a factor from the doses- �+�/8�I$B����7 *,I$B�6��= J��?��,@A6���<�BKMT�(������

- �+</I�I$B/U; J��?$�,@�6���<�B=MT������OM�@'��<�5��- �+</I�I$B/U; J��?$�,@�6���<�B=MT������OMV���(�/.��- ��?$��@A40�2G��<��$<EDW3�(<,@��X Y�������IZM[��?$�,@X6���<�B�

set up for AOV- ��?$��@AOD�<���.�0�2G<��/.; J��?$�,@X6���<�BH\]�������I^MN��?$�,@X��- ���@�@�<%��U7 L�/?$�*@XKDY<��,.��

print out table- ������3�3�6��6��/5��$_ S�/?+�*@AOD`<��/.$�and the parameters- ��?$��@AOD�<���.�0�2G<��/.; JB��/�7 L�/?+�*@A6���<�B�a\]��(�/��I^MQ�/?+�,@A��and on log scale- ���@�@�<%��U7 L�/?$�*@XKDY<��,.��

- ���@�@�<%��U7 L<��/.= LB��/�; S�/?$�*@X6���<�B��b\cB��/�; ��(�����/d������,��I7Me�/?+�,@X����

test for linearity of response

which gives

fagLhji�i+kml�nAoqpLrsSi$gXt`kju�vwx/yaz hji�u yaz/{ | s/kL} z/{ ~�� kj�jhs ��lAo ~ w

� l%ujh/� � �/���/�����OtF�G�/�L�/�����OtF���/�Rt����L�/���H�+t������/���J��s��j���� s/g��m�jh�kj��gP�/� �S���/�����KtF� �(�L���OtF�fap,u�s y�y ��p%��sL}%��g(oqpLrsSi$gXt`kju�vwoS�L}%��s*l�p�sJ����w � l%ujh/��� � l%ujh/��� � l%ujh/���

�J���KtF���/� �/�Rt������/�=tS�/�L�/�����%�RtF�/���/���fagLhji�i+kml�nAoqpLrsSi$gXt`kju�vw

x/yaz hji�u yaz/{ | s/kL} z/{]~P� k,�*h�s �/lXo ~ w

Page 38: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

13.2 One-way Layouts 34

��������� ���� ����������������������������������������������� "!�#$�%�& !�'�(*)���+�,"'-�� ���������������.� ������/ '%��0�0�+*��1324+���5627,��8�3249 :�!.0;(89�+�,�<>=-,��8�32?)��"'�!�<A@������A���CBD9 :�!.0E'�<�<F�GIH ��0J� GIH�K L !�+%M H�K N�O +�,��P! Q��32 N <,����627)��"'�!�< � P��������� ���P��������� ���P�%������� �-���������������������������� �������8���8� ���������� ���.� � ���������%���������& !�'�(*)���+�,"'-�� ����������� ���R�%����

The parameterization of linear models for designed experiments is a little tricky. The usualparameterization is to impose a ‘sum to zero’ constraint on the parameters for a factor. GLIMsets the parameter for the first level to zero, so that parameters for the the other levels are differ-ences between that level and the first. By default S uses the Helmert parameterization, whichcompares the second and subsequent levels to the average of lower levels. The usual parame-terization can be gotten as default by settingSJT�U$VXW�T�YXZ6[%\$T�Y�V�]_^PZ8VEZ�`�\6[�a�\�T�Y�VP]cb?Z8d�e3agfha�\_T�Y$V�]cbiUET$j�k6a�l�land the GLIM parameterization bySJT�U$VXW�T�YXZ6[%\$T�Y�V�]_^PZ8VEZ�`�\6[�a�\�T�Y�VP]cbmV�]�n�^�V�eEn�Y�V6a6fha�\$T�Y$VP]ob�U�T$j�k3a"l�l

Of course, the parameterization only affects the coefficients, not the fitted values, residuals,pApAp . The contrasts for a particular term in a fit can be changed by the q [*l function, e.g.q [?rP]sT�d$UofhZ8d�e6l or using \$T�Y$V�]�^PZ8V;Z .

There is a ‘clever’ way to test for linearity using a re-parameterization of the factorrP]�T�d$U

as anordered factor, for which the default parameterization is polynomial in t$u f pApAp fhvw[ j_n�xEn_jPZ_l$y

.(This relies on

j$T�rz[.{�TPZ"nPlhaving levels in an arithmetic progression. One could always useUET_j�k|[ j$T�rz[.{_TPZ�n$l�f~}El

in place ofj�{�T�Z�n

.)SJj�{�T�Z�n��$��T�]_{�n�]sn�{w[7�E^�\�VET�]w[.j$T�r�[.{�T$Z"nPl�l$lS�Z8d�e_e;^�]"k�bRj�ec[.^_T�xz[%j�T�rz[.\��EnAe6W�\$^$j$l���j�{�T�Z�nCf�\���n�e6Z$l$l(As far as I can see the use of

Z�d"e$e;^�]�k�b?j�eis necessary to get results for the individual coeffi-

cients.) This shows that the response can be regarded as quadratic in log(dose):

/ '%��0�0�+*��1C��,%0C24+��85627,����6249 :�!R0X(A9�+�,_<>=>,�)��"'�!XB�9%:P!.0E'�<�<� +�,�,��D+���562 G �8�*0"�$,�+-��,��8�62.9%:P!.0X(�9�+�,_<-=�,�)���'�!XB�)"+A�$+-��9%:P!.0�'�<& !�'�(*)���+�,"'3�L (4M �A� L !�)_(�+%M "� L +*�#�������������#�����P�%���I#��������P�%�� ����������������������P�

� ��! G�G (89�(8! M��_'3� O +�,���! H ��)������������ �I5_+�,��P!>Q��32 /�� � � <24� M��$!A�_9�!%����< �C��������� ����������� ����m������ �����������

,�)��"'�!6�m� �E�m���� �� ���.� ���% �.���������� �����������,�)��"'�!6��� #��������8� ���.� ���% #����P�%��� ����������,�)��"'�!6� � ��������� ���.� ���% ��������� ���������

& !�'�(*)���+�,�'A�$+%M")"+A��)�!A���������������� � ��AM���I)�!A���$!�!�'-� G�G �_!�!�)�� 0L �",8�P(4�$,�! & # H�K ��+A�$!�)����������P�%�

Page 39: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

13.3 Designed Experiments 35

����������������� ����������������������� ��"!# �$&%"'�$"$"�(��)�)�'*$"$� "�,+.-��/$10�&2*��3�4$#��657�98"8��*$"��5�8

:"��'"'�$�3��&�&���;��)6:"�<$�)")�� =�$��=�*�>�?A@ �=�*$&'* "$,0<��BC3� "����$��ED#3" ����"$>�GF

3" �����$��HD I3" �����$��GF I I3" �����$���: I I I

13.3 Designed Experiments

The central concept for designed experiments is a factor. Consider the famous Box-Cox poi-sons data (survival times (in hours) of animals with 3 poisons and 4 antidotes, from Box & Cox(1964), J. Roy. Statist. Soc. B26, 211–252 and Box, Hunter & Hunter (1977), Statistics for Ex-perimenters). The function J7K�LNMGOPQ�R�S*T generates the rows, columns and so on – consult itshelp page for full details.����U+V$"��W��X�� "�,� ?�Y 0���������Z�9 ���� Y B

data in hours)����[+V$"��W���3���&� ? ��'*$����<\�D"]�^"^"]�_<`ba[57�E�"c�-d'*$,0�3�\57�E�e-f0��*�������\� ?�Y�@�Y - Y�@"@�Y - Y�@"@�@�Y B"B0�����������W"�g <�������H)�'*�h+7$ ? )*�� ��� <$"�=�%�� ? ? �b-��b-i�*BV-E)���h+7$��<B-U�&�U+V$"��B0�&' ? +*)"'=��j=\� ? �k-l�*B"B0�3������� �$��<,%�� ? 0����������=B

plot main effects0�3������� �$��<,%�� ? 0����������m-d)�4"��\�+V$� *����nBand using medians�����*�� �/ ? 0����������=B

0�3������H)*�� ��=��' ? ����U+7$��go1�"'*$����#pq0��*�������-i ��&�*��\�0����������=Bbox plotsA�=�*$&'*�" &�&���Z�r0�3���� ? �"'�$"���b-s0���������t-u�&�v+7$"�=B

A�=�*$&'*�" &�&���Z�r0�3���� ? �"'�$"���b-s0���������t-u�&�v+7$"�m-w)�4���\,+7$� ���,�VB0����������>�G����2#W"�#����2 ? �&�v+7$"�xo1�"'�$"���6y(0��*������nB

full fit),�*��W��g)����*$� ? ����2 ? ����U+V$"�go1�"'�$"���;px0��*������nB"Badditive fit for 1dofna��4�+"+V��'"z ? 0��*����������G����2�B

0�&' ? +*)"'=��j=\� ? �k-l�*B"B/V���� ? '*$��<& ? 0��*����������i����2�B"B{"{ ����'�+ ? '*$��<� ? 0��*����������G����2�B"B0�3���� ? ),�"��$� ? 0����������>�G����2BV-�'<$"�=� ? 0��<����������G����2B"B��4�+"+V��'"z ? ����2 ? ����U+7$��go1�"'*$����#pq0��*������6p�)���*�=|���p���'*$"�&�t�r0��*������VB"B

which gives

} ��4�+"+7�&'"z ? 0����������>�G����2B~")6`�4�+���)�` { � $"�,��` { ������3�4�$ �"' ? ��B

�"'�$"�&� � �"�b�[5,�"I"!g�"Ib����I"!"����5��b����I"8"8���Ie�9I"I�I"I"I��"�0��������� � 5�I"�b����I5��g85V��!�8"I"!����"�b�����5����#Ie�9I"I�I"I"I�I"��"'�$"�&�t��0��*������ ! �"8b��I�5��"� ���[5,!"�"��! 5V�������<����Ie�h5"5,�"�"8�I"!_*$��<& �4��3�� �"! �"Ib��I��"�"8 �b�������<���

Page 40: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

13.3 Designed Experiments 36

Factors

mea

n of

stim

es3

45

6

A�

B�

C�

D

1

2�

3�4�

I

II

IIItreat

�repl� poison�

Factors

med

ian

of s

times

34

56

A�

B�

C�

D

12�3�

4�

I

II

IIItreat

�repl� poison�

24

68

1012

stim

es

A�

B�

C�

D

treat�

24

68

1012

stim

es

I II III

poison�

treat�

mea

n of

stim

es2

46

8

A�

B�

C�

D

poisonIIIIII

treat�

med

ian

of s

times

23

45

67

8

A�

B�

C�

D

poisonIIIIII

-4 -2 0 2 4

05

1015

20

resid(poisons.aov)

••

••

••

••••

• ••

••

•• •• • •• ••

••• •

• • •

Quantiles of Standard Normal

resi

d(po

ison

s.ao

v)

-2 -1 0 1 2

-20

24

••

• •

••

•• ••

•••

••

••• •• •• •

••

• •••• •

fitted(poisons.aov)

resi

d(po

ison

s.ao

v)

2 4 6 8

-20

24

Lambda

Log

Like

lihoo

d

-2 -1 0 1 2

-140

-120

-100

95%

Figure 5: Plots for Poison data

Page 41: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

13.3 Designed Experiments 37

�������������� ������� ����������������������� "!#�#�$����%& ('��)���+*�,( "��������.-/!#�#�$����%10�02 '&3����4��'53�6 7#���)%43�6 8(9#��:��;� <�� =8#0

��#����� > ?�,A@�B),�C�D�>�CA@FE�C�D�G�G4B�>A@FG�C�H�H�G(CI@JC�C�C�C�C�>�G!#�#�$����% , B�C�>A@F>�C�B�,�H�B1@FD�H�C�D�,(,�>A@F,�,�B�E�K�CI@JC�C�C�C�C�C�>L M'��)���N*�,�0 B B�HA@F>�E�,$KOB�HA@F>�E�,$K+, DA@F?;B�B�>�,(CI@JC�B),�H�B)H�G��#�����.-P!Q��������% H ?I@JD$K1B)> B1@F?�,�G�,�E CA@FG�D�D�?�>(CI@JH�B),�E�>�,�HR ���N��S�����:Q� >�D G�CA@FC�E�,�H ,A@F,�,$KN,�K

indicating the need for transformation. The T�U1V#V�V�W function protects the argument from ex-pansion; U�X�Y�Z�[�X;\Q]1^�_N`N^�a1W�b�c is equivalent to X;YdZ#[�X�\N]^;_#`�^�a;\�X�Y�Z#[�Xfeg]1^#_�`N^�a and generallyU�h[#i�X1^+Y`#W�b�a gives up to n-th order interactions.

There is no direct Box-Cox function, but we can do the operations by hand. They are quiteslow (25 secs on a SparcStation IPC), due to the overhead of calling the [#^�jkh�l#a1i�Xm_Q^�a :

n :5o�p(����6q �p�,sr�Bdrut+�Nv�CI@�B�0:���wN:#��x5o�py����@z����{��N��� n :�0%ko�p�:N��%+w��$|s ������=�1���+0%#:�%Nw��}o�p~:���w� F!NN��Sq �����������N0�0'N��� )�(��%�B-J:Q�)%Nw���|A n :�0�0��

��'� ��)t��1 n :A�=���;0"��CA@FC�B�0����(o�py�)���� � ����$�� ����;�=�1���N* n :A�=��������������� "!#�#�$����%10�������N��S�0�*�,�0:��$w+:��)x.�=�)�5o�p�%���:��$w� ���t;�1 n :I�����;0�0�p�%���,Q��:��$w� ����+0� � n :I�����#p#B$0���%Q:�%Nw���

��:N�������(o�py�)���� � ����$�� =:���w� �����������N0�����#������ ~!Q��������%d0$������+��S#0�*�,#0:��$w+:��)x.�=�)�5o�p~p�%;��,N��:���w� ����N0�p�%#:�%Nw���

�!#:��$�� n :�r�:���wN:#��xIr n :Q�)t&vk���#���QtQSQ�Q�dr��N:Q�)tOv4���+�$w����)x���:#��|Q����S;�dr�����!���vk��:;�$0:Q���QtQSQ��|;���5o�p�:���wN:#��x.�J:���wN:��)x(v�v��� n =:��$wN:#��x;0��:����q�)�5o�p~:Q���QtQSQ��|;����p~CI@FH5��6N{�|d�$��6m =CA@F?�H�r�B$0��tQ:���%��1 �:����q���Ar�C#0��{���:yo�p� g!���� �������1��0q�uK��yp�!;��� ����;��1��0q�J>$��0���!;��� ���!d��%d�$0m�F,$���� n �� �{1 n :A��B��;0r�:����q���� yCA@�B�����{���:�r���?�H����$0

A more efficient way (4 secs) is to use the function �d^�����^�� in the library Y1_$]1�#Z�� :

� ��_���Y�[+YQ��U�Y1_�]���Z��mW� �d^�����^���U�`$Xm_��1Z�`� �X�Y�Z�[�Xk\4]^;_�`N^�aqW

Now consider a Latin square. Six litters of six piglets were ranked in order of birthweight,providing a ¡£¢�¡ table, and each piglet given one of 6 dietary supplements in a Latin square.The weight gain (in kg) over 12 weeks is given in the table.

Page 42: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

13.3 Designed Experiments 38

������������ ���������������������! �"# ���#�"����#��� �����""! $���!#%��!#$"��! %���"��!#��

�'&���(��)�*�+��� ����������,.-0/21�/�-43657,�-48�3�9.-0,�:�/�-4321�1;-4:�<,.-08�3�,�-4/�1/�-4:�:�9.-0,�<�9�-4/�<�9�-4=�,:.->5?:�,�-4:�/�,�-@1A5B/.-C1)8�/�-@1�1!9�-@1)<<.-0,�:�,�-43657,�-4965D1E-0<�8�/�-49�<�,�-@1�19.-0=�,�:�-*5�9�9�-43�/�,.-0:21�9�-4/F5G,�-4/�/1E-08�,�9�-4<�,�,�-4,�9�/.-0,�=�/�-4=21!9�-43�3

�����������$HI���J�)K2LM�N�I�2���6���O��J�F���P��'�)����� - H�L��*QR�R�SH���� - �)�� )��(2�T���R� 9 � 9 �U�VO��� ��W�4X)LI�?�)Y[Z 5R\49 �]O��������JL)Z 5U\49 ���R�^ �I�2�J�.�_&���(��[���A��G`IO�K2� - ���� )�?(����aO����6���A�� � ������� �N�I�2�����_��LI���J��QU���)�6���O��J�F��� - �bK2c������K�cW�d&���(I�)���PeBX)LI�?�)Y ^ OI�?������L ^ � �2�J�.�fO)���F�*�R��� ?gbQ�QU�JL�hM�NO��J�F�*� - �bK2c6��� ?gbQ�QU�JL�h - O?Q;�NO)���F�*� - ��K2c6�

The last command gives t-values for the contrasts (diet ? i diet A).

�� ?gbQ�QU�JL�hM�NO��J�F�*� - �bK2c6�� H�jbgbQkK2H�j�l m����?�!j�l #�n �bObg6� o�LM� # �

X)LI�?�[Y , /.-08�321[=�,+5R-4,�:21):�=�8�3�-4/�:�:�9�:�/'=.-0=21),�1),�9�<�<O������I��L , /.-0/�3�=�1R5�5R-4,�1�1)=�:�<�3�-4/65?/�=�3�<'=.-0=21)8�9�3�1)/�/� ����� , 5�5U-09F5?/�,F5B3�-4<�3�<�,�=�<$1;-4=�:�:�,65?:'=.-0=F5?=65?,�1)8�=p �� )�J�bgF�bO� 3�= 5�5U-0<�9�,�8�8�=�-4,�9�:�3�8�8�� ?gbQ�QU�JL�h - O?Q;�NO)���F�*� - ��K2c6� �bO�O \ ��K�cW�dH[K2LJQ�gIO)�'Z$&���(��[���keBX[L��?�[Y ^ O��������JL ^ � �����.�q�����I�'Z�O��J�F���A�p �� )�J�bgF�bO� \

mF�*� 5br mI�������?� <)r mI��s� 3.-0=�,F5 � =.-03�8�=�9�=�-*5?365�5B=�-4<�/F5?,�=.-48�=�965

K)��H�HF�2�[�2���)�� \n �bObgF�7j2�)� -t� L�L)K�L ��c��bObg6�Go�LM����uv�wuS�

��x��)�I��L����?`[�F� ,�-49�=�,�= =�-4<�=�/�: 5�:.-4365?3�3 =.-4=�=�=�=-�-�-�-�-�-�-�-�-�-�-�-�-�-�-

� ����� " =�-@1[9F5?/ =�-@1[<�,�3 5U-4=�9�=�/ =.-4<�=F5�,� ����� =�-@1[=�<�< =�-@1[<�,�3 =.-48�3�9�/ =.-4<�9�,65� ����� � =�-4<�,�,�= =�-@1[<�,�3 =.-4:65?,�9 =.-@1[321[<� ����� � =�-48�/�=�= =�-@1[<�,�3 3.-43�3�:�/ =.-4=�<�/�,� ����� # 5R-4/�,�:�< =�-@1[<�,�3 1E-4=�<�8�8 =.-4=�=�=�9

-�-�-�-�-�-�-�-�-�-�-�-�-�-

Page 43: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

13.4 Generalized Linear Models 39

13.4 Generalized Linear Models

The functions ��� and ����� have extensions ���� which fits generalized linear models, and ����which further extends this to allow semi-parametric smooth functions in the explanatory vari-ables. We can, for example, fit the poisons data by a gamma GLM:

��� ��������� ��������� ���� ������������������! �"#� ���$�%�&����(')�#*+�,�'���.-/��)�������$021���3����4�5�6��&�������

note the ‘G’��7����(�8,�49�:����������(�;������3��������;�:��)���������;�<�����=� analysis of deviance table

Once again there is a whole range of ancillary functions such as >?'����������' , �),?'8>���� and,�'�����>�7 ���)� . The latter will produce a four types of residuals, but uses deviance residuals bydefault.

The 1 �&�3����4 argument is also used to specify other aspects of the fit such as the link function.For example, one can have 1 �&�3����4)5�@�������3�����;�A�)�&��B)5���,���@ �& ��� . With the binomial the re-sponse can either be a factor (taken as first level vs the rest) or a matrix with two columns givingthe number of successes and failures. There is a C�7�)��� family allowing user-defined models,and a ,���@�7��& family generator allowing robust fitting. The scope for ingenuity is unlimited!

Binary Data

The following example is taken from D. Collett (1991) Modelling Binary Data, page 217.Numbers of rotifers falling out of suspension for two species (Polyartha major and Keratellacochlearis) are given for different fluid densities in the table, as file ,�����&1 '8,$�D>?�� :

E�F�G8H�IAJ�KML%N=OPKQL�N=OPJSRTJQU�V�OPKQU�V�OWJ�RTJX�OZY�XA[ X�X \�] XA^_XA`�XX�OZY%a�Y b ]�` X�cdaTcS]X�OZY%a�X XAY b�` ^�Yea�^fcX�OZY%^�Y XA[ ]�^ XAYea�]%^X�OZY%^�Y [ \�` X�cgXAa%[X�OZY%^�Y a�X b�^ ^�\_XA`�XX�OZY%^�X XA^ a�[ a�`_XA`%bX�OZYfc�Y ^Tc c�c ^�aea�]%`X�OZYfc�Y XAY ^�X a�a_X�XhbX�OZYfc�X ^�` \�` a�^_XA`%aX�OZYfc�] a�Y a�b b cSaX�OZYfc�[ \Tc \�[ a�a cS]X�OZY%\�Y a�Y a�a [ cS[X�OZY%\�Y [ X�c ^TcgXA`%YX�OZY%`�Y X�c XAb b�X bfcX�OZY%`�X XAY a�a a�\ cS\X�OZY%`�^ `Tc `�` [TcgXAY�XX�OZY%b�Y `�] ]�` `�^ `%]X�OZY%b�YQc�]�]Qc�[�a_XAb�]_XA[%YX�OZY%b�Y ]�] ]�[_XA\TcgXA\fc

An annotated session follows. Several points need further explanation.

Page 44: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

13.4 Generalized Linear Models 40

The parametrizations need careful consideration. By default S uses a linear-model parameter-ization, contrasting each level with the average of the previous levels. This is less useful forGLMs. The first way out below is to remove the overall mean (the � � term) which forces sepa-rate means for each species. We can also change to the GLIM parameterization by the �����������line.

There is a catch here. By default �������������� numbers the factor levels in alphabetical order, sowe have to force the order we want (see � 10).

density�

pm.p

rop

1.02 1.03 1.04 1.05 1.06 1.07

0.0

0.2

0.4

0.6

0.8

1.0

• •

••

••

••

• ••

Figure 6: Plots for Rotifer data. The square symbols and dashed line indicate species Polyarthamajor.

������ �"!#��$&%�'(�"!�)+*-,.�")0/212!43#56���7�8�0�"!��9,:*2)#�454;=<8!�)�*!#��>7?�@������ �"!#��$ list the data frame)#���")�A0<B3C������0��!#��$@D AE,GF���+FH%�' D AI,KJ�L D AI,K���7� compute the proportionsF�MN,GF���+FH%�'OF+MP,KJ�L0F+MP,K���7�

and plot themF�1��7�E3Q*2! R�$��0��JB;SF+MP,TF��#F9;U��J7F8!�>�5QRV5�;WJ�1"�QMX>YAZ3\[];6^7@�@F����6R��"$Z3\*2! R�$� ��JB;SF+MP,GF���#F9;WF8A0<�>�["@F����6R��"$Z3\*2! R�$� ��JB; D AI,GF���#FZ@

fit separate models for each species_1 MP,TF�M`%�'a_�10Mb36A0/4�6R2*�3cF+MP,.J-;=F�MN,K���7�"' F�MN,KJ�@edf*2! R�$� ��JB;W/Z�gR��0MV�7)+1�3Q1��7_8�0��@�@_1 MP, D A&%�'a_�10Mb36A0/4�6R2*�3 D AI,.J-; D AE,K���7�"' D AE,KJ�@edf*2! R�$� ��JB;W/Z�gR��0MV�7)+1�3Q1��7_8�0��@�@_1 MP,TF�Mihj_10MN, D A bare summaries

Now combine the two species$0F8!�A��7!�$e%�'f��)�A#���7�E36AZ3C�"! FB3#5QF�ME54;Ck�["@Z;l�"!0F]3#5 D A�5�;Ck�["@�@Z;12!�m"!+12$�>nAZ3#5QF�ME54;o5 D A�5�@�@

������ �"!#��kp%�'q*2)#��)I,.���")rM�!43Q*2! R�$(>YAZ3Q*!0R8$�0��J-;j*2! R�$� ��J8@�;J"!�$e>HA43cF+MP,KJB; D AE,KJ8@�;U���7�n>YAZ3sF�MN,K����-; D AE,K�����@Z;t$0F8!�A��!�$�@

)#���")�A0<B3C������0��!#��k"@_1 MP,.��7�u%�'f_�10M93gA0/4�6R�*V3vJ�!�$�;w����"'�J"!�$�@&dx*2! R�$&yn$0F8!�A��7!�$V;W/Z�gR�� M��7)+1�3\1��7_8�0�8@�@

Page 45: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

13.4 Generalized Linear Models 41

����������� Note the parameterization used����������� ����������������������� ��!�"$#%�&'�(�'!�"�)+*+��,�-��.!��/"+01"�2/!�����!3"$#4�5���.�����6&���7�3��/��/)3)����������� separate means for each species82�/�(8�/"����88��3�'63"9'"8:.���9;8�88����<�=3�'!�6(9�/!���5;/#>;8�&9��3���?2.��&�5;�)3)����������� ����������������������� ��!�"$#%�&'�(�'!�"�)+*@�.!��/"�03"�2A!����&!�"�#B��������$�&6&�$�C��&�/��A)�)�����������"�D8���569���E� ����������&/)6��.�F�65� �����G�=��&/) over-dispersion, but a common slope

looks OK����������� ����������������������� ��!�"$#%�&'�(�'!�"�)+*@�.!��/"@-H"�2A!�����!�"�#I�5������$��68�$�C����A��A)�)�'���/!3"5�7��!��A"���3�J#%K/���'!8�$�L�����G�=���/)�M "�2/!�����!3"&:�:A;7N'��;�O/)�'���/!3"5�7��!��A"���3�J#%K/���'!8�$�L�����G�=���/)�M "�2/!�����!3"&:�:A; 2&�P;�OJ#%�&���:3Q')

these lines are rather crude, so try harder!R���!��S�3�1"�!&T��9,5�VU�WP#X,Y�ZU�[P#\UJ�ZU�U/,&)����!��S�3��2��'!8�'�&�9P�L�������=���<#]��69'6E�=K3�'6��5!5�C�.!��A"&:&�'!�2^� R��.!���#_W')Y#

"�2A!�����!�"8:�K�6��9���E���5�L�'!�2<�9;72&�E;�#%`/,&)Y#]��!�2^�9;�N���;/#\`/,&)�)Y#�.!(F'!8�."&:1�5�9;72&�E;�#X;�N'��;&)�)3)Y#]3��2/!8:/;��'!�"�2�9�/"�!.;�)

�'���/!3"5� R��.!���#%���.!��aMb,YcV`A,�O<#\��3��:�Q�)�'���/!3"5� R��.!���#%���.!��aMZ`�WJcb,�U3W�OJ#d��3��:3Q')

Poisson Data

We consider the log-linear analysis of a contingency table. As this has two ‘history’ factorsand two levels of the the response, it could also be treated as binomial data. The response isthe occurrence of coronary heart disease. The table is of the form:

blood pressureserum

chd cholesterol 1 2 3 4

yes 1 2 3 3 42 3 2 1 33 8 11 6 64 7 12 11 11

no 1 117 121 47 222 85 98 43 203 119 209 68 434 67 99 46 33

with log-linear analysis:e�f.gih'jlk�m�n�eao9pqsrtrvuwrtqixyrlzix'xH{t{s|}x�q~x�xlx'xx�x�|ix�q�x�uA|�q�qSz/�l��zSu/rtq��wx'x���q�����{�zHu/rt{�|����Hu5{sr'r

� eYn&g$�Akyh�j��/��k��aoC�/�/�/k�k3� xP�_uG� k.���.f�g�� x<�_uG� m��A����m<o&�9�E� � �9e��.p�p��� h�jS�/n3�Yn�� � �/n�g5�<o � n�m��V���Ak�����eao�mPo uG�Vu���q p �V� e5n9g$�Ak�p � e�f�gPp

Page 46: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

13.5 Updating and Selecting Models 42

������������� ������������������������������ ���!����"�#�$&%('�)�+*��-, .��0/ *�����/-�1%2$ )-3�)�.�����4)-��/-5�)6���!�7�8� ��7%930���:3�.<;=�#<*�;�4����������?>@��A��� $ )-3��6�B���7�C����D% �E�F�B�����������HG8���!� ���6GB"# $04�0)��&�I��'��J/-K .!"6�L><%NM�4�4PO(�0�!/-31�B�������0�Q�6>�4

The)-��/-5�)

command gives an analysis of deviance for����

objects:

RTSVU�W:X!SJYIZ-Z&[C\�]V^1_a`�b�cQ`�d egfLh0i�e:jk U�S]l!c�icmW:npo!bQX�i:SgU q�bsr!SVt�]�b

u W�i:c-cWLUv^ W�w�b]

x b-cVy�WLU c-b{z|U-}^

r!b�~Q^0c�Sw-w�bwTc�b�Q} bgU�` iS]-]:l�Y�n iV~�cQ`�`�W�]�S-cQ`�jo-n�o!b�X i:SgU q-b x b-c�iQw6[ao�n x b�c�i�wP[ao�bQX u ~{YBfQh0i-j

������ � � �V���� [����-�c-bQ~}^ � ���P[ � �-� ��� �V�-��� [�� �-� �6[��-�����-���-�y�~!b-c�c � ��� �P[����-� � � � � � �6[ � �-�E�6[��-�����-���-�qVh�w �����g��� [ � � � � � �-�P[ �-� ���6[��-�����-���-�

c�b�~:}L^�z�y�~!b-c�c � � � [ ����� �V� �� [ ����� �6[��-� �������c�b�~:}^DzIqgh�w � � �P[ ��� � � � � � [�� ��� �6[��-�����-� �-�y�~�b�c�c+zIqgh�w � �V� [���� � � � [��-� � �6[��-����� � �-�

c�b�~:}L^�z�y�~!b-c�c{z�qVh�w � � [���� � � �P[��-�����6[�� ���:� ���-�

13.5 Updating and Selecting Models

There are number of facilities to update models. The��� $ )�30�

function takes a result of a pre-vious fit and changes the model in some way.)�$!$�>

and$!� /-�{>

show the (approximate) effects of adding and dropping single terms, and�3��-�

runs a fairly general stepwise fitting procedure. (Note that S-Plus 3.x has a separate�3��-�!K�*����

function for multiple regression.)

Page 47: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Multivariate�

Analysis 43

14 Multivariate Analysis

S-Plus is particularly rich is functions for exploratory multivariate analysis, such as ��������� , ����� and ������� . There are also functions for classical multivariate analysis.

Clustering

The workhorses here are ������� which computes distance matrices (also used in ��������������� ) and ��������� which computes a cluster tree by single-, average- or complete linkage.

������� Distance matrix calculations ��������� Hierarchical clustering���������� Create groups from a cluster tree������������ Plot a cluster tree��� �������� Label a cluster tree plot��������� ��� Re-order leaves of a cluster tree�� � � ��� Extract part of a cluster tree���������� ”model-based” clustering������� ��� auxiliary functions�����������

Graphical Methods

This is a varied collection of functions for displaying multivariate data.

��������������� Classical multi-dimensional scaling! ����� � Chernoff’s faces�"��������� Minimal spanning tree��������� Star plots ��������� Biplot (v 3.2)

Two analyses of socio-economic data on Swiss cantons:#�$&%�'�()'�*"+,'�$.-�#�/)*�0132�451�$�6879+.68: $�6�69;=< 0<>2�4@?BA 1�6�?�(C#�/�+D1�0?�EF2�4G<IH@JKE&LNMO?CP32�4G<QHRJSP�L/�T�6�?K-�#�U�79+&?�E�JV?�P9JV7�*�-�/�W�XBY�X�0[Z@\�'�U]A^#�$.%�'�(8'�*9+_' $&-�#�/8* 07�/8<�7"+.?�E JO?�P9JO6�/CT�+&?�E�0�0

`a2�4[`�?�#)b 6)7"+D1�0-�#�?�#)b 6)7"+S`�0?]b�7�'�/�/�+c`IJVd�0-�#�?�#)b 6)7"+.?C#�U�'�1�/8'9+S`QJe?]b�7�'�/�/�+c`IJVd�0�0�0fZR'�/�4CU�'�1�/)'@7�'�/�/g$.Y�7�UF7�`�'�/�/5h�'�U)b�-�6

Page 48: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Multivariate�

Analysis 44

Matrix Methods

The classical methods based on variance-covariance matrices.

����������� ������� Mahalanobis distances��������� Canonical correlation analysis� ������� Discriminant analysis������������� Principal components analysis��� �� ������ Principal components analysis (v 3.2)� ����� ������ Principal components analysis (v 3.2)

An example of discriminant analysis with Fisher’s iris data:��� ���� �����"!�#$�%���� ��& ��������'�(�(*)*+,(-�������'�(�(/.�+0(-��� ���'�((21+ 3��� 4����4��5!#"��4�� & )76810(9��4�� &;:�< (81�33��� ���� � ���=!�# � ������ & ��������� ������(>1 3��� ���� � �?!#@��� ��A�B�����DCFE�C@��� ���� � ����G��������IH � �J �"� ��������K�L�� ����M��������������4�����%N��J� & �����J ��& ��� ���A� � �O(-�J��4����4��33��� ���� P?!#Q�������A� � �R'%(S)�+UTV��� ���� W@!#@��� ��A� � �X'�(2.�+��� ����2����Y!�#Y� & ��4�� &�Z � Z ( :< 3[(-�F4�� &�Z � Z ( :�< 3[(\��4�� &KZ � Z ( :< 33������� & ��� ����]PO(-�������� W^(_��W��4�` Z Z (aP�������` Z*� �����J� � ������� �*�b�������=�������%������4 Z (

W�������` Z �%4��� �=� ���������*�b�� ���=� �����%��� �4 Z 3��4�P� & ��� ����]PO(-�������� W^(c��������2�����O(d�4�P�` < �8e 3

first discriminant variablef

seco

nd d

iscr

imin

ant v

aria

ble

-10 -5 0g

5

-9-8

-7-6

-5-4

sh

sh

shsh

sh

sh

shsh

sh sh

sh

sh

shsh

sh

sh

sh

shshsh

sh

shsh

sh

shsh

shsh sh

shsh

sh

sh

sh

sh

sh

shsh

sh

sh

sh

sh

sh

sh sh

sh

sh

sh

sh

shci

cici

ci

cici

ci

ci

ci

ci

ci

ci

ci

ci

cicici

cici ci

ci

ci

ci ci

cici

ci

cici

cici ci

cici

ci

ci

ci

ci

ci

cici

ci

cici

cicici ci

cici

vj

vj

vj

vj

vj

vjvj

vjvj

vj

vj

vj

vj

vj

vjvj

vj

vj

vj

vj

vj

vj

vj

vj

vj

vjvjvj

vj

vjvj

vj

vj

vj

vj

vj

vj

vjvj

vj

vjvj

vj

vj

vj

vj

vj

vj

vj

vj

Figure 7: Discriminant analysis

Page 49: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

Libr�

aries 45

A Libraries

Libraries are a mechanism to add ‘packages’ of extra objects (functions and datasets) to S. Tofind out which libraries are available type�������������� ��

which on one of my systems gave:�������������������� "!���#�$��%�&��!('�)*�('�+*',���,'.-����/�0�/$����(�*��-)1'�)�243

5�67��*8�9�: ;�<18�6=?>�65�7�<18�@��18�9�:

#.�),��� =�A���#%$��%�&��!B$,�?��'.�*C����DC,'�$*��!('E�*C($��GFH��!JI��K*'0F,L*����! =�A���#%$��%�&��!D'.��C/�&-*M,��#%$1!���),�.FN�����D:*���/5(O1'E�, �A�'� *�PI��K�$1�%)���'&� Q*'.�*C����(�%K�$*��)���'��SRG��'%)� 1�T��&-�M���#�$1!JI�UFV'� *� >���!.L���'%2W�UFV'� 1��!PIFH'EL�! >���!.L���'%2/����FV'.L�!?���.$��(L,)��M���#�$����&��!PIL,)�� C�)1'%� 5�C�)1'%�"��K*'0F,L*���X��),�EF"@�)�� �)1'0F�FH�%)ZYU!B[1'E��A�'��ZIL,)�� *��K1'\F 6�K1'0F,L*�,��!���),�.FN@�),�� �)*'0F�FV��)]YU!B[1'.��A�'&�ZI!��\FH'E�,$���#�! =�A���#%$��%�&��!B��)�.FS#.��'.L$1��)_^�^`���a������:*���N5?O*'.�, �A�'% 1�PI

)��0L*�,��2 ;bIc>4Id<���L����%2ZYU!X$1��'�#.�e���, f��A���#�$����&��!

=���)�F���)1�N����,��)�FH'%$��%�&�g�&����'�#.�N�*��-,)*'�)�2"!���#�$��%�&�W!�����$����D<�6�h�>�[�6(�������g�����'�#.�N�*��-)1'�)�2"!���#�$����&�aC1�.)*��#%$,��)�2/��)i���i5�j%@�O�k�5?)�A���3

�1��-)1'%)�2PRl������LNm"n�!���#�$�������o.��'0FV��pfT

O��0-,)*'�)�2N!���#%$��%�&��!X��),�EF"q1�E��'E-*����!�r(<���L*�,��2sR�^Et�t�u�Tv [,��C��%)��/h�L�L*�*���&C(5�$*'�$���!�$���#�!B���E$��i5�j%@,��A�!wY

[�h5�5 FV',���a�1��-)1'%)�2������$ ���.A,)*'��x����$��,��)�y1!!.L�'�$���'�� !EL�'�$���'&�a!�$*'�$���!�$���#�!

To find out more about a section, use�������������� UzV{*��|�}

name�

e.g.pf�1�0-,)1'%)�2JR~���&�&L�m�),��-�A�!�$�T��A���#%$�������!����)()�&-�A�!�$N!�$1'%$���!�$���#�!

8��%<PR�2�T ���$1��)*j���A�'�)�$��%���B)1'.� 1���A�-��%)PR�2]�ly/m�^HI��1T Q�A�-���)g����#�'�$����&�g���.$��/[�h�>"!�#�'��,���A�-��%)1!VR�2Z�lygm�^VI��P��F�Ab��!T Q�A�-���)DL,),��L*�,!�'��f�������E$��DF,A"y��*�����4��!�y����������,)*�� JR�KZ��2Z�lyNm�^HI��1T Q�A�-���)f),�&-�A�!%$()1�% �)1��!�!���&�

C�'%$1'�!���$*!

#.���0F #���L�L��%)W���g���*�����\FH��'&�?�,���&A)'.-�-��%2 �V��#�y1�&�"���i!�21�E�V�E$1��)��#�y

Page 50: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

A.1�

Library ��������� 46

� � ��� ����������� � �������������� ������������ � ����!"���#�%$&�������"�('��������*),+�-�.�/�),+�0�1

To use the library, invoke it by243"5�6"7"897�:<;

name =which attaches it as a data directory at the end of the search list. Thus libraries cannot over-ridestandard functions nor your own functions. To make a library over-ride the system functions,use243"5�6"7"897�:<;

name >@? 5�7BA�C�DE =which attaches it at position 2 (after the FHG 8�CB8 directory).

A.1 Library 7B5�IB3�J�:

This is a collection of useful functions and datasets for teaching at Oxford.

���K9L���KNMOKQPH�R Box-Cox plot for transformations.!�SMUT��9� � PWV�NPX��,Y �ZPHK"R replacement for GLIM [�\�] .��^���'#������V equally-scaled plot function.��V��� ���BMX��_�`��'�V�R calculate standardized residuals from a fit.��V�a��������BMb��_�`���' V"R calculate Studentized residuals from a fit.

Datasets in the library are:

��'�'�������V���� US accidental deaths 1973-8'�� � �#�9V dataset on heat evolved in setting cements'#��a�� dataset on performance of cpus����� V���� time series on UK lung deaths 1974-9 from Diggle� ������V��"�SPcT������V��"� as above, for males and females! �#���#� remission times on leukaemia patients (censored)T����_"��� Forbes’ dataset on boiling points, from Atkinson�B������� dataset on times of Scottish hill races���#a9� (uncensored) survival times on leukaemia patients��� time series on luteinizing hormone from Diggle� � ��� ����� body weight(kg) and brain weight (g) of mammals, from Weisberg� '��'��� motorcycle impact data – Silverman JRSS B 1985� ��V9������V�V�� accelerated life testing on motorettes����V�V � � time-series of temperatures in Nottingham, 1920-1939������ dataset on road deaths in the US���' � dataset on relating permeability to physical measurements��a�_�_"� � dataset on rubber wear�#�B���"� ship damage incidents, from McCullagh & NelderV�� ����� Black Cherry trees heights, diameters and volumes

A.2 Sources of Libraries

Many S users have generously collected together their functions and datasets together into li-braries and made them publically available. An archive of libraries is maintained at Carnegie-

Page 51: Introductory Guide to S-Plussun.cwru.edu/~jiayang/Smanual35-51.pdf · Introductory Guide to S-Plus Final Version B.D. Ripley Professor of Applied Statistics, ... release of S, with

A.2�

Sources of Libraries 47

Mellon as a service to the statistical profession by Mike Meyer. To obtain details of its contentsby e-mail send a message to

��������������� �������������� �����������

with body��������� ��������������� ������! �"�# �%$

Ftp to ���&��� �����&�'����&�(�)�)� with user � ��������� is also available.