r news 4(1)

46
News The Newsletter of the R Project Volume 4/1, June 2004 Editorial by Thomas Lumley R has been well accepted in the academic world for some time, but this issue of R News begins with a description of the use of R in a business setting. Marc Schwartz describes how his consulting com- pany, MedAnalytics, came to use R instead of com- mercial statistical software, and has since provided both code and financial support to the R Project. The growth of the R community was also evident at the very successful useR! 2004, the first conference for R users. While R aims to blur the distinctions be- tween users and programmers, useR! was definitely different in aims and audience from the previous DSC conferences held in 1999, 2001, and 2003. John Fox provides a summary of the conference in this is- sue for those who could not make it to Vienna. A free registration to useR! was the prize in the competition for a new graphic for http://www. r-project.org. You will have seen the winning graphic, from Eric Lecoutre, on the way to down- loading this newsletter. The graphic is unedited R output (click on it to see the code), showing the power of R for creating complex graphs that are completely reproducible. I would also en- courage R users to visit the Journal of Statisti- cal Software (http:///www.jstatsoft.org), which publishes peer-reviewed code and documentation. R is heavily represented in recent issues of the journal. The success of R in penetrating so many statisti- cal niches is due in large part to the wide variety of packages produced by the R community. We have articles describing contributed packages for princi- pal component analysis (ade4, used in creating Eric Lecoutre’s winning graphic) and statistical process control (qcc) and the recommended survival pack- age. Jianhua Zhang and Robert Gentlement describe tools from the Bioconductor project that make it eas- ier to explore the resulting variety of code and docu- mentation. The Programmers’ Niche and Help Desk columns both come from guest contributors this time. Gabor Grothendieck,known as frequent poster of answers on the mailing list r-help, has been invited to con- tribute to the R Help Desk. He presents “Date and Time Classes in R”. For the Programmers’ Niche, I have written a simple example of classes and meth- ods using the old and new class systems in R. This example arose from a discussion at an R program- ming course. Programmers will also find useful Doug Bates’ ar- ticle on least squares calculations. Prompted by dis- cussions on the r-devel list, he describes why the simple matrix formulae from so many textbooks are not the best approach either for speed or for accuracy. Thomas Lumley Department of Biostatistics University of Washington, Seattle [email protected] Contents of this issue: Editorial ...................... 1 The Decision To Use R .............. 2 The ade4 package - I : One-table methods ... 5 qcc: An R package for quality control charting and statistical process control ......... 11 Least Squares Calculations in R ......... 17 Tools for interactively exploring R packages . . 20 The survival package ............... 26 useR! 2004 ..................... 28 R Help Desk .................... 29 Programmers’ Niche ............... 33 Changes in R .................... 36 Changes on CRAN ................ 41 R Foundation News ................ 46

Upload: hoangcong

Post on 12-Jan-2017

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: R News 4(1)

NewsThe Newsletter of the R Project Volume 41 June 2004

Editorialby Thomas Lumley

R has been well accepted in the academic world forsome time but this issue of R News begins witha description of the use of R in a business settingMarc Schwartz describes how his consulting com-pany MedAnalytics came to use R instead of com-mercial statistical software and has since providedboth code and financial support to the R Project

The growth of the R community was also evidentat the very successful useR 2004 the first conferencefor R users While R aims to blur the distinctions be-tween users and programmers useR was definitelydifferent in aims and audience from the previousDSC conferences held in 1999 2001 and 2003 JohnFox provides a summary of the conference in this is-sue for those who could not make it to Vienna

A free registration to useR was the prize inthe competition for a new graphic for httpwwwr-projectorg You will have seen the winninggraphic from Eric Lecoutre on the way to down-loading this newsletter The graphic is uneditedR output (click on it to see the code) showingthe power of R for creating complex graphs thatare completely reproducible I would also en-courage R users to visit the Journal of Statisti-cal Software (httpwwwjstatsoftorg) whichpublishes peer-reviewed code and documentation Ris heavily represented in recent issues of the journal

The success of R in penetrating so many statisti-

cal niches is due in large part to the wide variety ofpackages produced by the R community We havearticles describing contributed packages for princi-pal component analysis (ade4 used in creating EricLecoutrersquos winning graphic) and statistical processcontrol (qcc) and the recommended survival pack-age Jianhua Zhang and Robert Gentlement describetools from the Bioconductor project that make it eas-ier to explore the resulting variety of code and docu-mentation

The Programmersrsquo Niche and Help Desk columnsboth come from guest contributors this time GaborGrothendieckknown as frequent poster of answerson the mailing list r-help has been invited to con-tribute to the R Help Desk He presents ldquoDate andTime Classes in Rrdquo For the Programmersrsquo Niche Ihave written a simple example of classes and meth-ods using the old and new class systems in R Thisexample arose from a discussion at an R program-ming course

Programmers will also find useful Doug Batesrsquo ar-ticle on least squares calculations Prompted by dis-cussions on the r-devel list he describes why thesimple matrix formulae from so many textbooks arenot the best approach either for speed or for accuracy

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

thomaslumleyR-projectorg

Contents of this issue

Editorial 1The Decision To Use R 2The ade4 package - I One-table methods 5qcc An R package for quality control charting

and statistical process control 11Least Squares Calculations in R 17

Tools for interactively exploring R packages 20The survival package 26useR 2004 28R Help Desk 29Programmersrsquo Niche 33Changes in R 36Changes on CRAN 41R Foundation News 46

Vol 41 June 2004 2

The Decision To Use RA Consulting Business Perspective

by Marc Schwartz

Preface

The use of R has grown markedly during the pastfew years which is a testament to the dedication andcommitment of R Core to the contributions of thegrowing useR base and to the variety of disciplinesin which R is being used The recent useR meeting inVienna was another strong indicator of growth andwas an outstanding success

In addition to its use in academic areas R isincreasingly being used in non-academic environ-ments such as government agencies and variousbusinesses ranging from small firms to large multi-national corporations in a variety of industries

In a sense this diversification in the user baseis a visible demonstration of the greater comfortthat business-based decision makers have regard-ing the use of Open-Source operating systems suchas Linux and applications such as R Whetherthese tools and associated support programs are pur-chased (such as commercial Linux distributions) orare obtained through online collaborative commu-nities businesses are increasingly recognizing thevalue of Open-Source software

In this article I will describe the use of R (andother Open-Source tools) in a healthcare-based con-sulting firm and the value realized by us and impor-tantly by our clients through the use of these tools inour company

Some Background on the Company

MedAnalytics is a healthcare consulting firm locatedin a suburb of Minneapolis Minnesota The com-pany was founded and legally incorporated in Au-gust of 2000 to provide a variety of analytic and re-lated services to two principle constituencies Thefirst are healthcare providers (hospitals and physi-cians) engaged in quality improvement programsThese clients will typically have existing clinical (asopposed to administrative) databases and wish touse these to better understand the interactions be-tween patient characteristics clinical processes andoutcomes The second are drug and medical devicecompanies engaged in late-phase and post-marketclinical studies For these clients MedAnalyticsprovides services related to protocol and case re-port form design and development interim and fi-nal analyses and independent data safety monitor-ing board appointments In these cases MedAnalyt-ics will typically partner with healthcare technology

and service companies to develop web-based elec-tronic data capture study management and report-ing We have worked in many medical specialties in-cluding cardiac surgery cardiology ophthalmologyorthopaedics gastroenterology and oncology

Prior to founding MedAnalytics I had 5 years ofclinical experience in cardiac surgery and cardiologyand over 12 years with a medical database softwarecompany where I was principally responsible for themanagement and analysis of several large multi-sitenational and regional sub-specialty databases thatwere sponsored by professional medical societiesMy duties included contributing to core dataset de-sign the generation of periodic aggregate and com-parative analyses the development of multi-variablerisk models and contributing to peer-reviewed pa-pers In that position the majority of my analyseswere performed using SAS (Base Stat and Graph)though S-PLUS (then from StatSci) was evaluated inthe mid-90rsquos and used for a period of time with somesmaller datasets

Evaluating The Tools of the Trade -A Value Based Equation

The founder of a small business must determine howhe or she is going to finance initial purchases (ofwhich there are many) and the ongoing operatingcosts of the company until such time as revenuesequal (and we hope ultimately exceed) those costsIf financial decisions are not made wisely a companycan find itself rapidly rdquoburning cashrdquo and risking itsviability (It is for this reason that most small busi-nesses fail within the first two or three years)

Because there are market-driven thresholds ofwhat prospective clients are willing to pay for ser-vices the cost of the tools used to provide clientservices must be carefully considered If your costsare too high you will not recover them via clientbillings and (despite attitudes that prevailed dur-ing the dotcom bubble) you canrsquot make up per-itemlosses simply by increasing business volume

An important part of my companyrsquos infrastruc-ture would be the applications software that I usedHowever it would be foolish for me to base this de-cision solely on cost I must consider the rdquovaluerdquowhich takes into account four key characteristicscost time quality and client service Like any busi-ness mine would strive to minimize cost and timewhile maximizing quality and client service How-ever there is a constant tension among these factorsand the key would be to find a reasonable balanceamong them while meeting (and we hope exceed-ing) the needs and expectations of our clients

First and foremost a business must present itself

R News ISSN 1609-3631

Vol 41 June 2004 3

and act in a responsible and professional manner andmeet legal and regulatory obligations There wouldbe no value in the software if my clients and I couldnot trust the results I obtained

In the case of professional tools such as softwarebeing used internally one typically has many alter-natives to consider You must first be sure that thepotential choices are validated against a list of re-quired functional and quality benchmarks driven bythe needs of your prospective clients You must ofcourse be sure that the choices fit within your op-erating budget In the specific case of analytic soft-ware applications the list to consider and the pricepoints were quite varied At the high end of the pricerange are applications like SAS and SPSS whichhave become prohibitively expensive for small busi-nesses that require commercial licenses Despite mypast lengthy experience with SAS it would not havebeen financially practical to consider using it Asingle-user desktop license for the three key compo-nents (Base Stat and Graph) would have cost about(US)$5000 per year well beyond my budget

Over a period of several months through late2000 I evaluated demonstration versions of severalstatistical applications At the time I was using Win-dows XP and thus considered Windows versions ofany applications I narrowed the list to S-PLUS andStata Because of my prior experience with both SASand S-PLUS I was comfortable with command lineapplications as opposed to those with menu-drivenrdquopoint and clickrdquo GUIs Both S-PLUS and Statawould meet my functional requirements regardinganalytic methods and graphics Because both pro-vided for programming within the language I couldreasonably expect to introduce some productivityenhancements and reduce my costs by more efficientuse of my time

rdquoAn Introduction To Rrdquo

About then (late 2000) I became aware of R throughcontacts with other users and through posts to var-ious online forums I had not yet decided on S-PLUS or Stata so I downloaded and installed R togain some experience with it This was circa R ver-sion 12x I began to experiment with R readingthe available online documentation perusing poststo the r-help list and using some of the books onS-PLUS that I had previously purchased most no-tably the first edition (1994) of Venables and RipleyrsquosMASS

Although I was not yet specifically aware of theR Community and the general culture of the GPLand Open-Source application development the basicpremise and philosophy was not foreign to me I wasquite comfortable with online support mechanismshaving used Usenet online rdquoknowledge basesrdquo andother online community forums in the past I had

found that more often that not these provided morecompetent and expedient support than did the tele-phone support from a vendor I also began to re-alize that the key factor in the success of my busi-ness would be superior client service and not expen-sive proprietary technology My experience and myneeds helped me accept the basic notion of Open-Source development and online community-basedsupport

From my earliest exposure to R it was evidentto me that this application had evolved substantiallyin a relatively short period of time (as it has contin-ued to do) under the leadership of some of the bestminds in the business It also became rapidly evi-dent that it would fit my functional requirements andmy comfort level with it increased substantially overa period of months Of course my prior experiencewith S-PLUS certainly helped me to learn R rapidly(although there are important differences between S-PLUS and R as described in the R FAQ)

I therefore postponed any purchase decisions re-garding S-PLUS or Stata and made a commitment tousing R

I began to develop and implement various func-tions that I would need in providing certain analyticand reporting capabilities for clients Over a periodof several weeks I created a package of functionsfor internal use that substantially reduced the timeit takes to import and structure data from certainclients (typically provided as Excel workbooks or asASCII files) and then generate a large number of ta-bles and graphs These types of reports typically usestandardized datasets and are created reasonably fre-quently Moreover I could easily enhance and intro-duce new components to the reports as required bythe changing needs of these clients Over time I in-troduced further efficiencies and was able to take aprocess that initially had taken several days quite lit-erally down to a few hours As a result I was able tospend less time on basic data manipulation and moretime on the review interpretation and description ofkey findings

In this way I could reduce the total time requiredfor a standardized project reducing my costs andalso increasing the value to the client by being able tospend more time on the higher-value services to theclient instead of on internal processes By enhancingthe quality of the product and ultimately passing mycost reductions on to the client I am able to increasesubstantially the value of my services

A Continued Evolution

In my more than three years of using R I have con-tinued to implement other internal process changesMost notably I have moved day-to-day computingoperations from Windows XP to Linux Needless tosay Rrsquos support of multiple operating systems made

R News ISSN 1609-3631

Vol 41 June 2004 4

this transition essentially transparent for the applica-tion software I also started using ESS as my primaryworking environment for R and enjoyed its higherlevel of support for R compared to the editor that Ihad used under Windows which simply providedsyntax highlighting

My transformation to Linux occurred graduallyduring the later Red Hat (RH) 80 beta releases as Ibecame more comfortable with Linux and with theprocess of replacing Windows functionality that hadbecome rdquosecond naturerdquo over my years of MS op-erating system use (dating back to the original IBMPCDOS days of 1981) For example I needed to re-place my practice in R of generating WMF files to beincorporated into Word and Powerpoint documentsAt first I switched to generating EPS files for incorpo-ration into OpenOfficeorgrsquos (OOorg) Writer and Im-press documents but now I primarily use LATEX andthe rdquoseminarrdquo slide creation package frequently inconjunction with Sweave Rather than using Power-point or Impress for presentations I create landscapeoriented PDF files and display them with Adobe Ac-robat Reader which supports a full screen presenta-tion mode

I also use OOorgrsquos Writer for most documentsthat I need to exchange with clients For the finalversion I export a PDF file from Writer knowingthat clients will have access to Adobersquos Reader Be-cause OOorgrsquos Writer supports MS Word input for-mats it provides for interchange with clients whenjointly editable documents (or tracking of changes)are required When clients send me Excel worksheetsI use OOorgrsquos Calc which supports Excel data for-mats (and in fact exports much better CSV files fromthese worksheets than does Excel) I have replacedOutlook with Ximianrsquos Evolution for my e-mail con-tact list and calendar management including the co-ordination of electronic meeting requests with busi-ness partners and clients and of course synching mySony Clie (which Sony recently announced will soonbe deprecated in favor of smarter cell phones)

With respect to Linux I am now using FedoraCore (FC) 2 having moved from RH 80 to RH 9 toFC 1 I have in time replaced all Windows-basedapplications with Linux-based alternatives with oneexception and that is some accounting software that Ineed in order to exchange electronic files with my ac-countant In time I suspect that this final hurdle willbe removed and with it the need to use Windowsat all I should make it clear that this is not a goalfor philosophic reasons but for business reasons Ihave found that the stability and functionality of theLinux environment has enabled me to engage in a va-riety of technical and business process related activi-ties that would either be problematic or prohibitivelyexpensive under Windows This enables me to real-ize additional productivity gains that further reduceoperating costs

Giving Back

In closing I would like to make a few commentsabout the R Community

I have felt fortunate to be a useR and I also havefelt strongly that I should give back to the Commu-nity - recognizing that ultimately through MedAna-lytics and the use of R I am making a living usingsoftware that has been made available through vol-untary efforts and which I did not need to purchase

To that end over the past three years I have com-mitted to contributing and giving back to the Com-munity in several ways First and foremost by re-sponding to queries on the r-help list and later onthe r-devel list There has been and will continueto be a continuous stream of new useRs who needbasic assistance As current userRs progress in expe-rience each of us can contribute to the Communityby assisting new useRs In this way a cycle developsin which the students evolve into teachers

Secondly in keeping with the philosophy thatrdquouseRs become developersrdquo as I have found theneed for particular features and have developed spe-cific solutions I have made them available via CRANor the lists Two functions in particular have beenavailable in the rdquogregmiscrdquo package since mid-2002thanks to Greg Warnes These are the barplot2()and CrossTable() functions As time permits I planto enhance most notably the CrossTable() functionwith a variety of non-parametric correlation mea-sures and will make those available when ready

Thirdly in a different area I have committed tofinancially contributing back to the Community bysupporting the R Foundation and the recent useR2004 meeting This is distinct from volunteering mytime and recognizes that ultimately even a volun-tary Community such as this one has some oper-ational costs to cover When in 2000-2001 I wascontemplating the purchase of S-PLUS and Statathese would have cost approximately (US)$2500 and(US)$1200 respectively In effect I have committedthose dollars to supporting the R Foundation at theBenefactor level over the coming years

I would challenge and urge other commercialuseRs to contribute to the R Foundation at whateverlevel makes sense for your organization

Thanks

It has been my pleasure and privilege to be a mem-ber of this Community over the past three years Ithas been nothing short of phenomenal to experiencethe growth of this Community during that time

I would like to thank Doug Bates for his kind in-vitation to contribute this article the idea for whicharose in a discussion we had during the recent useRmeeting in Vienna I would also like to extend mythanks to R Core and to the R Community at large

R News ISSN 1609-3631

Vol 41 June 2004 5

for their continued drive to create and support thiswonderful vehicle for international collaboration

Marc SchwartzMedAnalytics Inc Minneapolis Minnesota USAMSchwartzMedAnalyticscom

The ade4 package - I One-table methodsby Daniel Chessel Anne B Dufour and Jean Thioulouse

Introduction

This paper is a short summary of the main classesdefined in the ade4 package for one table analysismethods (eg principal component analysis) Otherpapers will detail the classes defined in ade4 for two-tables coupling methods (such as canonical corre-spondence analysis redundancy analysis and co-inertia analysis) for methods dealing with K-tablesanalysis (ie three-ways tables) and for graphicalmethods

This package is a complete rewrite of the ADE4software (Thioulouse et al (1997) httppbiluniv-lyon1frADE-4) for the R environment It containsData Analysis functions to analyse Ecological andEnvironmental data in the framework of EuclideanExploratory methods hence the name ade4 (ie 4 isnot a version number but means that there are four Ein the acronym)

The ade4 package is available in CRAN but itcan also be used directly online thanks to the Rwebsystem (httppbiluniv-lyon1frRweb) Thispossibility is being used to provide multivariateanalysis services in the field of bioinformatics par-ticularly for sequence and genome structure analysisat the PBIL (httppbiluniv-lyon1fr) An ex-ample of these services is the automated analysis ofthe codon usage of a set of DNA sequences by corre-spondence analysis (httppbiluniv-lyon1frmvacoaphp)

The duality diagram class

The basic tool in ade4 is the duality diagram Es-coufier (1987) A duality diagram is simply a list thatcontains a triplet (X Q D)

- X is a table with n rows and p columns consid-ered as p points in Rn (column vectors) or n points inRp (row vectors)

- Q is a p times p diagonal matrix containing theweights of the p columns of X and used as a scalarproduct in Rp (Q is stored under the form of a vectorof length p)

- D is a n times n diagonal matrix containing theweights of the n rows of X and used as a scalar prod-uct in Rn (D is stored under the form of a vector oflength n)

For example if X is a table containing normalizedquantitative variables if Q is the identity matrix Ip

and if D is equal to 1n In the triplet corresponds to a

principal component analysis on correlation matrix(normed PCA) Each basic method corresponds toa particular triplet (see table 1) but more complexmethods can also be represented by their duality di-agram

Functions Analyses Notesdudipca principal component 1dudicoa correspondence 2dudiacm multiple correspondence 3dudifca fuzzy correspondence 4dudimix analysis of a mixture of

numeric and factors5

dudinsc non symmetric corre-spondence

6

dudidec decentered correspon-dence

7

The dudi functions 1 Principal component analysissame as prcompprincomp 2 Correspondence analysisGreenacre (1984) 3 Multiple correspondence analysisTenenhaus and Young (1985) 4 Fuzzy correspondenceanalysis Chevenet et al (1994) 5 Analysis of a mixtureof numeric variables and factors Hill and Smith (1976)Kiers (1994) 6 Non symmetric correspondence analysisKroonenberg and Lombardo (1999) 7 Decentered corre-spondence analysis Doleacutedec et al (1995)

The singular value decomposition of a tripletgives principal axes principal components and rowand column coordinates which are added to thetriplet for later use

We can use for example a well-known datasetfrom the base package

gt data(USArrests)

gt pca1 lt- dudipca(USArrestsscannf=FALSEnf=3)

scannf = FALSE means that the number of prin-cipal components that will be used to compute rowand column coordinates should not be asked interac-tively to the user but taken as the value of argumentnf (by default nf = 2) Other parameters allow tochoose between centered normed or raw PCA (de-fault is centered and normed) and to set arbitraryrow and column weights The pca1 object is a du-ality diagram ie a list made of several vectors anddataframes

R News ISSN 1609-3631

Vol 41 June 2004 6

gt pca1

Duality diagramm

class pca dudi

$call dudipca(df = USArrests scannf = FALSE nf=3)

$nf 3 axis-components saved

$rank 4

eigen values 248 09898 03566 01734

vector length mode content

1 $cw 4 numeric column weights

2 $lw 50 numeric row weights

3 $eig 4 numeric eigen values

dataframe nrow ncol content

1 $tab 50 4 modified array

2 $li 50 3 row coordinates

3 $l1 50 3 row normed scores

4 $co 4 3 column coordinates

5 $c1 4 3 column normed scores

other elements cent norm

pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)

The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs

gt score(pca1)

gt scorcircle(pca1$co)

minus2 minus1 0 1 2

510

15

Murder

minus2 minus1 0 1 2

5010

015

020

025

030

0

Assault

minus2 minus1 0 1 2

3040

5060

7080

90 UrbanPop

minus2 minus1 0 1 2

1020

3040

Rape

Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations

Murder

Assault

UrbanPop

Rape

Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components

The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)

gt scatter(pca1)

R News ISSN 1609-3631

Vol 41 June 2004 7

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Murder

Assault

UrbanPop

Rape

Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph

Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions

gt slabel(pca1$li)

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Figure 4 Individuals factor map in a normed PCA

Distance matrices

A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)

The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances

gt data(yanomama)

gt gen lt- quasieuclid(asdist(yanomama$gen))

gt geo lt- quasieuclid(asdist(yanomama$geo))

gt ant lt- quasieuclid(asdist(yanomama$ant))

gt geo1 lt- dudipco(geo scann = FALSE nf = 3)

gt gen1 lt- dudipco(gen scann = FALSE nf = 3)

gt ant1 lt- dudipco(ant scann = FALSE nf = 3)

gt par(mfrow=c(22))

gt scatter(geo1)

gt scatter(gen1)

gt scatter(ant1posi=bottom)

d = 100

1 2

3

4 5 6 7

8 9

10

11 12 13

14

15

16 17 18

19

d = 20

1

2

3

4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

19

d = 100

1

2 3

4

5

6

7

8

9 10

11

12 13

14 15

16

17 18

19

Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances

R News ISSN 1609-3631

Vol 41 June 2004 8

Taking into account groups of indi-viduals

In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)

gt data(meaudret)

gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)

gt bet1lt-between(coa1meaudret$plan$sta

+ scannf=FALSE)

gt plot(bet1)

The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object

d = 1

Canonical weights

d = 1

Eda

Bsp

Brh

Bni

Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Canonical weights

Variables

Eda

Bsp

Brh

Bni Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Variables

00

000

501

001

5

Eigenvalues

d = 05

Scores and classes

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

S1

S2

S3

S4

S5

Axis1 Axis2

Inertia axes

d = 02

Classes

S1

S2

S3

S4

S5

Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)

Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus

is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function

Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1

w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))

Permutation tests

Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R

R News ISSN 1609-3631

Vol 41 June 2004 9

version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons

The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values

gt test1lt-randtestbetween(bet1)

gt test1

Monte-Carlo test

Observation 04292

Call randtestbetween(xtest = bet1)

Based on 999 replicates

Simulated p-value 0001

gt plot(test1main=Between class inertia)

Between class inertia

sim

Fre

quency

01 02 03 04 05

0100

200

300

Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram

Conclusion

We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)

We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))

The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that

try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space

Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development

Bibliography

Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107

Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8

Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5

Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5

Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9

Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5

Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7

Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7

Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5

Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5

R News ISSN 1609-3631

Vol 41 June 2004 10

Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125

Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5

Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9

Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7

Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9

Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7

Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7

Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399

Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate

method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8

Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8

Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5

Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5

Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9

Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8

Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8

Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8

R News ISSN 1609-3631

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 2: R News 4(1)

Vol 41 June 2004 2

The Decision To Use RA Consulting Business Perspective

by Marc Schwartz

Preface

The use of R has grown markedly during the pastfew years which is a testament to the dedication andcommitment of R Core to the contributions of thegrowing useR base and to the variety of disciplinesin which R is being used The recent useR meeting inVienna was another strong indicator of growth andwas an outstanding success

In addition to its use in academic areas R isincreasingly being used in non-academic environ-ments such as government agencies and variousbusinesses ranging from small firms to large multi-national corporations in a variety of industries

In a sense this diversification in the user baseis a visible demonstration of the greater comfortthat business-based decision makers have regard-ing the use of Open-Source operating systems suchas Linux and applications such as R Whetherthese tools and associated support programs are pur-chased (such as commercial Linux distributions) orare obtained through online collaborative commu-nities businesses are increasingly recognizing thevalue of Open-Source software

In this article I will describe the use of R (andother Open-Source tools) in a healthcare-based con-sulting firm and the value realized by us and impor-tantly by our clients through the use of these tools inour company

Some Background on the Company

MedAnalytics is a healthcare consulting firm locatedin a suburb of Minneapolis Minnesota The com-pany was founded and legally incorporated in Au-gust of 2000 to provide a variety of analytic and re-lated services to two principle constituencies Thefirst are healthcare providers (hospitals and physi-cians) engaged in quality improvement programsThese clients will typically have existing clinical (asopposed to administrative) databases and wish touse these to better understand the interactions be-tween patient characteristics clinical processes andoutcomes The second are drug and medical devicecompanies engaged in late-phase and post-marketclinical studies For these clients MedAnalyticsprovides services related to protocol and case re-port form design and development interim and fi-nal analyses and independent data safety monitor-ing board appointments In these cases MedAnalyt-ics will typically partner with healthcare technology

and service companies to develop web-based elec-tronic data capture study management and report-ing We have worked in many medical specialties in-cluding cardiac surgery cardiology ophthalmologyorthopaedics gastroenterology and oncology

Prior to founding MedAnalytics I had 5 years ofclinical experience in cardiac surgery and cardiologyand over 12 years with a medical database softwarecompany where I was principally responsible for themanagement and analysis of several large multi-sitenational and regional sub-specialty databases thatwere sponsored by professional medical societiesMy duties included contributing to core dataset de-sign the generation of periodic aggregate and com-parative analyses the development of multi-variablerisk models and contributing to peer-reviewed pa-pers In that position the majority of my analyseswere performed using SAS (Base Stat and Graph)though S-PLUS (then from StatSci) was evaluated inthe mid-90rsquos and used for a period of time with somesmaller datasets

Evaluating The Tools of the Trade -A Value Based Equation

The founder of a small business must determine howhe or she is going to finance initial purchases (ofwhich there are many) and the ongoing operatingcosts of the company until such time as revenuesequal (and we hope ultimately exceed) those costsIf financial decisions are not made wisely a companycan find itself rapidly rdquoburning cashrdquo and risking itsviability (It is for this reason that most small busi-nesses fail within the first two or three years)

Because there are market-driven thresholds ofwhat prospective clients are willing to pay for ser-vices the cost of the tools used to provide clientservices must be carefully considered If your costsare too high you will not recover them via clientbillings and (despite attitudes that prevailed dur-ing the dotcom bubble) you canrsquot make up per-itemlosses simply by increasing business volume

An important part of my companyrsquos infrastruc-ture would be the applications software that I usedHowever it would be foolish for me to base this de-cision solely on cost I must consider the rdquovaluerdquowhich takes into account four key characteristicscost time quality and client service Like any busi-ness mine would strive to minimize cost and timewhile maximizing quality and client service How-ever there is a constant tension among these factorsand the key would be to find a reasonable balanceamong them while meeting (and we hope exceed-ing) the needs and expectations of our clients

First and foremost a business must present itself

R News ISSN 1609-3631

Vol 41 June 2004 3

and act in a responsible and professional manner andmeet legal and regulatory obligations There wouldbe no value in the software if my clients and I couldnot trust the results I obtained

In the case of professional tools such as softwarebeing used internally one typically has many alter-natives to consider You must first be sure that thepotential choices are validated against a list of re-quired functional and quality benchmarks driven bythe needs of your prospective clients You must ofcourse be sure that the choices fit within your op-erating budget In the specific case of analytic soft-ware applications the list to consider and the pricepoints were quite varied At the high end of the pricerange are applications like SAS and SPSS whichhave become prohibitively expensive for small busi-nesses that require commercial licenses Despite mypast lengthy experience with SAS it would not havebeen financially practical to consider using it Asingle-user desktop license for the three key compo-nents (Base Stat and Graph) would have cost about(US)$5000 per year well beyond my budget

Over a period of several months through late2000 I evaluated demonstration versions of severalstatistical applications At the time I was using Win-dows XP and thus considered Windows versions ofany applications I narrowed the list to S-PLUS andStata Because of my prior experience with both SASand S-PLUS I was comfortable with command lineapplications as opposed to those with menu-drivenrdquopoint and clickrdquo GUIs Both S-PLUS and Statawould meet my functional requirements regardinganalytic methods and graphics Because both pro-vided for programming within the language I couldreasonably expect to introduce some productivityenhancements and reduce my costs by more efficientuse of my time

rdquoAn Introduction To Rrdquo

About then (late 2000) I became aware of R throughcontacts with other users and through posts to var-ious online forums I had not yet decided on S-PLUS or Stata so I downloaded and installed R togain some experience with it This was circa R ver-sion 12x I began to experiment with R readingthe available online documentation perusing poststo the r-help list and using some of the books onS-PLUS that I had previously purchased most no-tably the first edition (1994) of Venables and RipleyrsquosMASS

Although I was not yet specifically aware of theR Community and the general culture of the GPLand Open-Source application development the basicpremise and philosophy was not foreign to me I wasquite comfortable with online support mechanismshaving used Usenet online rdquoknowledge basesrdquo andother online community forums in the past I had

found that more often that not these provided morecompetent and expedient support than did the tele-phone support from a vendor I also began to re-alize that the key factor in the success of my busi-ness would be superior client service and not expen-sive proprietary technology My experience and myneeds helped me accept the basic notion of Open-Source development and online community-basedsupport

From my earliest exposure to R it was evidentto me that this application had evolved substantiallyin a relatively short period of time (as it has contin-ued to do) under the leadership of some of the bestminds in the business It also became rapidly evi-dent that it would fit my functional requirements andmy comfort level with it increased substantially overa period of months Of course my prior experiencewith S-PLUS certainly helped me to learn R rapidly(although there are important differences between S-PLUS and R as described in the R FAQ)

I therefore postponed any purchase decisions re-garding S-PLUS or Stata and made a commitment tousing R

I began to develop and implement various func-tions that I would need in providing certain analyticand reporting capabilities for clients Over a periodof several weeks I created a package of functionsfor internal use that substantially reduced the timeit takes to import and structure data from certainclients (typically provided as Excel workbooks or asASCII files) and then generate a large number of ta-bles and graphs These types of reports typically usestandardized datasets and are created reasonably fre-quently Moreover I could easily enhance and intro-duce new components to the reports as required bythe changing needs of these clients Over time I in-troduced further efficiencies and was able to take aprocess that initially had taken several days quite lit-erally down to a few hours As a result I was able tospend less time on basic data manipulation and moretime on the review interpretation and description ofkey findings

In this way I could reduce the total time requiredfor a standardized project reducing my costs andalso increasing the value to the client by being able tospend more time on the higher-value services to theclient instead of on internal processes By enhancingthe quality of the product and ultimately passing mycost reductions on to the client I am able to increasesubstantially the value of my services

A Continued Evolution

In my more than three years of using R I have con-tinued to implement other internal process changesMost notably I have moved day-to-day computingoperations from Windows XP to Linux Needless tosay Rrsquos support of multiple operating systems made

R News ISSN 1609-3631

Vol 41 June 2004 4

this transition essentially transparent for the applica-tion software I also started using ESS as my primaryworking environment for R and enjoyed its higherlevel of support for R compared to the editor that Ihad used under Windows which simply providedsyntax highlighting

My transformation to Linux occurred graduallyduring the later Red Hat (RH) 80 beta releases as Ibecame more comfortable with Linux and with theprocess of replacing Windows functionality that hadbecome rdquosecond naturerdquo over my years of MS op-erating system use (dating back to the original IBMPCDOS days of 1981) For example I needed to re-place my practice in R of generating WMF files to beincorporated into Word and Powerpoint documentsAt first I switched to generating EPS files for incorpo-ration into OpenOfficeorgrsquos (OOorg) Writer and Im-press documents but now I primarily use LATEX andthe rdquoseminarrdquo slide creation package frequently inconjunction with Sweave Rather than using Power-point or Impress for presentations I create landscapeoriented PDF files and display them with Adobe Ac-robat Reader which supports a full screen presenta-tion mode

I also use OOorgrsquos Writer for most documentsthat I need to exchange with clients For the finalversion I export a PDF file from Writer knowingthat clients will have access to Adobersquos Reader Be-cause OOorgrsquos Writer supports MS Word input for-mats it provides for interchange with clients whenjointly editable documents (or tracking of changes)are required When clients send me Excel worksheetsI use OOorgrsquos Calc which supports Excel data for-mats (and in fact exports much better CSV files fromthese worksheets than does Excel) I have replacedOutlook with Ximianrsquos Evolution for my e-mail con-tact list and calendar management including the co-ordination of electronic meeting requests with busi-ness partners and clients and of course synching mySony Clie (which Sony recently announced will soonbe deprecated in favor of smarter cell phones)

With respect to Linux I am now using FedoraCore (FC) 2 having moved from RH 80 to RH 9 toFC 1 I have in time replaced all Windows-basedapplications with Linux-based alternatives with oneexception and that is some accounting software that Ineed in order to exchange electronic files with my ac-countant In time I suspect that this final hurdle willbe removed and with it the need to use Windowsat all I should make it clear that this is not a goalfor philosophic reasons but for business reasons Ihave found that the stability and functionality of theLinux environment has enabled me to engage in a va-riety of technical and business process related activi-ties that would either be problematic or prohibitivelyexpensive under Windows This enables me to real-ize additional productivity gains that further reduceoperating costs

Giving Back

In closing I would like to make a few commentsabout the R Community

I have felt fortunate to be a useR and I also havefelt strongly that I should give back to the Commu-nity - recognizing that ultimately through MedAna-lytics and the use of R I am making a living usingsoftware that has been made available through vol-untary efforts and which I did not need to purchase

To that end over the past three years I have com-mitted to contributing and giving back to the Com-munity in several ways First and foremost by re-sponding to queries on the r-help list and later onthe r-devel list There has been and will continueto be a continuous stream of new useRs who needbasic assistance As current userRs progress in expe-rience each of us can contribute to the Communityby assisting new useRs In this way a cycle developsin which the students evolve into teachers

Secondly in keeping with the philosophy thatrdquouseRs become developersrdquo as I have found theneed for particular features and have developed spe-cific solutions I have made them available via CRANor the lists Two functions in particular have beenavailable in the rdquogregmiscrdquo package since mid-2002thanks to Greg Warnes These are the barplot2()and CrossTable() functions As time permits I planto enhance most notably the CrossTable() functionwith a variety of non-parametric correlation mea-sures and will make those available when ready

Thirdly in a different area I have committed tofinancially contributing back to the Community bysupporting the R Foundation and the recent useR2004 meeting This is distinct from volunteering mytime and recognizes that ultimately even a volun-tary Community such as this one has some oper-ational costs to cover When in 2000-2001 I wascontemplating the purchase of S-PLUS and Statathese would have cost approximately (US)$2500 and(US)$1200 respectively In effect I have committedthose dollars to supporting the R Foundation at theBenefactor level over the coming years

I would challenge and urge other commercialuseRs to contribute to the R Foundation at whateverlevel makes sense for your organization

Thanks

It has been my pleasure and privilege to be a mem-ber of this Community over the past three years Ithas been nothing short of phenomenal to experiencethe growth of this Community during that time

I would like to thank Doug Bates for his kind in-vitation to contribute this article the idea for whicharose in a discussion we had during the recent useRmeeting in Vienna I would also like to extend mythanks to R Core and to the R Community at large

R News ISSN 1609-3631

Vol 41 June 2004 5

for their continued drive to create and support thiswonderful vehicle for international collaboration

Marc SchwartzMedAnalytics Inc Minneapolis Minnesota USAMSchwartzMedAnalyticscom

The ade4 package - I One-table methodsby Daniel Chessel Anne B Dufour and Jean Thioulouse

Introduction

This paper is a short summary of the main classesdefined in the ade4 package for one table analysismethods (eg principal component analysis) Otherpapers will detail the classes defined in ade4 for two-tables coupling methods (such as canonical corre-spondence analysis redundancy analysis and co-inertia analysis) for methods dealing with K-tablesanalysis (ie three-ways tables) and for graphicalmethods

This package is a complete rewrite of the ADE4software (Thioulouse et al (1997) httppbiluniv-lyon1frADE-4) for the R environment It containsData Analysis functions to analyse Ecological andEnvironmental data in the framework of EuclideanExploratory methods hence the name ade4 (ie 4 isnot a version number but means that there are four Ein the acronym)

The ade4 package is available in CRAN but itcan also be used directly online thanks to the Rwebsystem (httppbiluniv-lyon1frRweb) Thispossibility is being used to provide multivariateanalysis services in the field of bioinformatics par-ticularly for sequence and genome structure analysisat the PBIL (httppbiluniv-lyon1fr) An ex-ample of these services is the automated analysis ofthe codon usage of a set of DNA sequences by corre-spondence analysis (httppbiluniv-lyon1frmvacoaphp)

The duality diagram class

The basic tool in ade4 is the duality diagram Es-coufier (1987) A duality diagram is simply a list thatcontains a triplet (X Q D)

- X is a table with n rows and p columns consid-ered as p points in Rn (column vectors) or n points inRp (row vectors)

- Q is a p times p diagonal matrix containing theweights of the p columns of X and used as a scalarproduct in Rp (Q is stored under the form of a vectorof length p)

- D is a n times n diagonal matrix containing theweights of the n rows of X and used as a scalar prod-uct in Rn (D is stored under the form of a vector oflength n)

For example if X is a table containing normalizedquantitative variables if Q is the identity matrix Ip

and if D is equal to 1n In the triplet corresponds to a

principal component analysis on correlation matrix(normed PCA) Each basic method corresponds toa particular triplet (see table 1) but more complexmethods can also be represented by their duality di-agram

Functions Analyses Notesdudipca principal component 1dudicoa correspondence 2dudiacm multiple correspondence 3dudifca fuzzy correspondence 4dudimix analysis of a mixture of

numeric and factors5

dudinsc non symmetric corre-spondence

6

dudidec decentered correspon-dence

7

The dudi functions 1 Principal component analysissame as prcompprincomp 2 Correspondence analysisGreenacre (1984) 3 Multiple correspondence analysisTenenhaus and Young (1985) 4 Fuzzy correspondenceanalysis Chevenet et al (1994) 5 Analysis of a mixtureof numeric variables and factors Hill and Smith (1976)Kiers (1994) 6 Non symmetric correspondence analysisKroonenberg and Lombardo (1999) 7 Decentered corre-spondence analysis Doleacutedec et al (1995)

The singular value decomposition of a tripletgives principal axes principal components and rowand column coordinates which are added to thetriplet for later use

We can use for example a well-known datasetfrom the base package

gt data(USArrests)

gt pca1 lt- dudipca(USArrestsscannf=FALSEnf=3)

scannf = FALSE means that the number of prin-cipal components that will be used to compute rowand column coordinates should not be asked interac-tively to the user but taken as the value of argumentnf (by default nf = 2) Other parameters allow tochoose between centered normed or raw PCA (de-fault is centered and normed) and to set arbitraryrow and column weights The pca1 object is a du-ality diagram ie a list made of several vectors anddataframes

R News ISSN 1609-3631

Vol 41 June 2004 6

gt pca1

Duality diagramm

class pca dudi

$call dudipca(df = USArrests scannf = FALSE nf=3)

$nf 3 axis-components saved

$rank 4

eigen values 248 09898 03566 01734

vector length mode content

1 $cw 4 numeric column weights

2 $lw 50 numeric row weights

3 $eig 4 numeric eigen values

dataframe nrow ncol content

1 $tab 50 4 modified array

2 $li 50 3 row coordinates

3 $l1 50 3 row normed scores

4 $co 4 3 column coordinates

5 $c1 4 3 column normed scores

other elements cent norm

pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)

The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs

gt score(pca1)

gt scorcircle(pca1$co)

minus2 minus1 0 1 2

510

15

Murder

minus2 minus1 0 1 2

5010

015

020

025

030

0

Assault

minus2 minus1 0 1 2

3040

5060

7080

90 UrbanPop

minus2 minus1 0 1 2

1020

3040

Rape

Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations

Murder

Assault

UrbanPop

Rape

Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components

The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)

gt scatter(pca1)

R News ISSN 1609-3631

Vol 41 June 2004 7

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Murder

Assault

UrbanPop

Rape

Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph

Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions

gt slabel(pca1$li)

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Figure 4 Individuals factor map in a normed PCA

Distance matrices

A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)

The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances

gt data(yanomama)

gt gen lt- quasieuclid(asdist(yanomama$gen))

gt geo lt- quasieuclid(asdist(yanomama$geo))

gt ant lt- quasieuclid(asdist(yanomama$ant))

gt geo1 lt- dudipco(geo scann = FALSE nf = 3)

gt gen1 lt- dudipco(gen scann = FALSE nf = 3)

gt ant1 lt- dudipco(ant scann = FALSE nf = 3)

gt par(mfrow=c(22))

gt scatter(geo1)

gt scatter(gen1)

gt scatter(ant1posi=bottom)

d = 100

1 2

3

4 5 6 7

8 9

10

11 12 13

14

15

16 17 18

19

d = 20

1

2

3

4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

19

d = 100

1

2 3

4

5

6

7

8

9 10

11

12 13

14 15

16

17 18

19

Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances

R News ISSN 1609-3631

Vol 41 June 2004 8

Taking into account groups of indi-viduals

In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)

gt data(meaudret)

gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)

gt bet1lt-between(coa1meaudret$plan$sta

+ scannf=FALSE)

gt plot(bet1)

The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object

d = 1

Canonical weights

d = 1

Eda

Bsp

Brh

Bni

Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Canonical weights

Variables

Eda

Bsp

Brh

Bni Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Variables

00

000

501

001

5

Eigenvalues

d = 05

Scores and classes

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

S1

S2

S3

S4

S5

Axis1 Axis2

Inertia axes

d = 02

Classes

S1

S2

S3

S4

S5

Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)

Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus

is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function

Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1

w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))

Permutation tests

Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R

R News ISSN 1609-3631

Vol 41 June 2004 9

version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons

The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values

gt test1lt-randtestbetween(bet1)

gt test1

Monte-Carlo test

Observation 04292

Call randtestbetween(xtest = bet1)

Based on 999 replicates

Simulated p-value 0001

gt plot(test1main=Between class inertia)

Between class inertia

sim

Fre

quency

01 02 03 04 05

0100

200

300

Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram

Conclusion

We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)

We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))

The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that

try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space

Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development

Bibliography

Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107

Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8

Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5

Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5

Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9

Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5

Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7

Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7

Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5

Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5

R News ISSN 1609-3631

Vol 41 June 2004 10

Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125

Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5

Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9

Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7

Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9

Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7

Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7

Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399

Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate

method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8

Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8

Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5

Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5

Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9

Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8

Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8

Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8

R News ISSN 1609-3631

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 3: R News 4(1)

Vol 41 June 2004 3

and act in a responsible and professional manner andmeet legal and regulatory obligations There wouldbe no value in the software if my clients and I couldnot trust the results I obtained

In the case of professional tools such as softwarebeing used internally one typically has many alter-natives to consider You must first be sure that thepotential choices are validated against a list of re-quired functional and quality benchmarks driven bythe needs of your prospective clients You must ofcourse be sure that the choices fit within your op-erating budget In the specific case of analytic soft-ware applications the list to consider and the pricepoints were quite varied At the high end of the pricerange are applications like SAS and SPSS whichhave become prohibitively expensive for small busi-nesses that require commercial licenses Despite mypast lengthy experience with SAS it would not havebeen financially practical to consider using it Asingle-user desktop license for the three key compo-nents (Base Stat and Graph) would have cost about(US)$5000 per year well beyond my budget

Over a period of several months through late2000 I evaluated demonstration versions of severalstatistical applications At the time I was using Win-dows XP and thus considered Windows versions ofany applications I narrowed the list to S-PLUS andStata Because of my prior experience with both SASand S-PLUS I was comfortable with command lineapplications as opposed to those with menu-drivenrdquopoint and clickrdquo GUIs Both S-PLUS and Statawould meet my functional requirements regardinganalytic methods and graphics Because both pro-vided for programming within the language I couldreasonably expect to introduce some productivityenhancements and reduce my costs by more efficientuse of my time

rdquoAn Introduction To Rrdquo

About then (late 2000) I became aware of R throughcontacts with other users and through posts to var-ious online forums I had not yet decided on S-PLUS or Stata so I downloaded and installed R togain some experience with it This was circa R ver-sion 12x I began to experiment with R readingthe available online documentation perusing poststo the r-help list and using some of the books onS-PLUS that I had previously purchased most no-tably the first edition (1994) of Venables and RipleyrsquosMASS

Although I was not yet specifically aware of theR Community and the general culture of the GPLand Open-Source application development the basicpremise and philosophy was not foreign to me I wasquite comfortable with online support mechanismshaving used Usenet online rdquoknowledge basesrdquo andother online community forums in the past I had

found that more often that not these provided morecompetent and expedient support than did the tele-phone support from a vendor I also began to re-alize that the key factor in the success of my busi-ness would be superior client service and not expen-sive proprietary technology My experience and myneeds helped me accept the basic notion of Open-Source development and online community-basedsupport

From my earliest exposure to R it was evidentto me that this application had evolved substantiallyin a relatively short period of time (as it has contin-ued to do) under the leadership of some of the bestminds in the business It also became rapidly evi-dent that it would fit my functional requirements andmy comfort level with it increased substantially overa period of months Of course my prior experiencewith S-PLUS certainly helped me to learn R rapidly(although there are important differences between S-PLUS and R as described in the R FAQ)

I therefore postponed any purchase decisions re-garding S-PLUS or Stata and made a commitment tousing R

I began to develop and implement various func-tions that I would need in providing certain analyticand reporting capabilities for clients Over a periodof several weeks I created a package of functionsfor internal use that substantially reduced the timeit takes to import and structure data from certainclients (typically provided as Excel workbooks or asASCII files) and then generate a large number of ta-bles and graphs These types of reports typically usestandardized datasets and are created reasonably fre-quently Moreover I could easily enhance and intro-duce new components to the reports as required bythe changing needs of these clients Over time I in-troduced further efficiencies and was able to take aprocess that initially had taken several days quite lit-erally down to a few hours As a result I was able tospend less time on basic data manipulation and moretime on the review interpretation and description ofkey findings

In this way I could reduce the total time requiredfor a standardized project reducing my costs andalso increasing the value to the client by being able tospend more time on the higher-value services to theclient instead of on internal processes By enhancingthe quality of the product and ultimately passing mycost reductions on to the client I am able to increasesubstantially the value of my services

A Continued Evolution

In my more than three years of using R I have con-tinued to implement other internal process changesMost notably I have moved day-to-day computingoperations from Windows XP to Linux Needless tosay Rrsquos support of multiple operating systems made

R News ISSN 1609-3631

Vol 41 June 2004 4

this transition essentially transparent for the applica-tion software I also started using ESS as my primaryworking environment for R and enjoyed its higherlevel of support for R compared to the editor that Ihad used under Windows which simply providedsyntax highlighting

My transformation to Linux occurred graduallyduring the later Red Hat (RH) 80 beta releases as Ibecame more comfortable with Linux and with theprocess of replacing Windows functionality that hadbecome rdquosecond naturerdquo over my years of MS op-erating system use (dating back to the original IBMPCDOS days of 1981) For example I needed to re-place my practice in R of generating WMF files to beincorporated into Word and Powerpoint documentsAt first I switched to generating EPS files for incorpo-ration into OpenOfficeorgrsquos (OOorg) Writer and Im-press documents but now I primarily use LATEX andthe rdquoseminarrdquo slide creation package frequently inconjunction with Sweave Rather than using Power-point or Impress for presentations I create landscapeoriented PDF files and display them with Adobe Ac-robat Reader which supports a full screen presenta-tion mode

I also use OOorgrsquos Writer for most documentsthat I need to exchange with clients For the finalversion I export a PDF file from Writer knowingthat clients will have access to Adobersquos Reader Be-cause OOorgrsquos Writer supports MS Word input for-mats it provides for interchange with clients whenjointly editable documents (or tracking of changes)are required When clients send me Excel worksheetsI use OOorgrsquos Calc which supports Excel data for-mats (and in fact exports much better CSV files fromthese worksheets than does Excel) I have replacedOutlook with Ximianrsquos Evolution for my e-mail con-tact list and calendar management including the co-ordination of electronic meeting requests with busi-ness partners and clients and of course synching mySony Clie (which Sony recently announced will soonbe deprecated in favor of smarter cell phones)

With respect to Linux I am now using FedoraCore (FC) 2 having moved from RH 80 to RH 9 toFC 1 I have in time replaced all Windows-basedapplications with Linux-based alternatives with oneexception and that is some accounting software that Ineed in order to exchange electronic files with my ac-countant In time I suspect that this final hurdle willbe removed and with it the need to use Windowsat all I should make it clear that this is not a goalfor philosophic reasons but for business reasons Ihave found that the stability and functionality of theLinux environment has enabled me to engage in a va-riety of technical and business process related activi-ties that would either be problematic or prohibitivelyexpensive under Windows This enables me to real-ize additional productivity gains that further reduceoperating costs

Giving Back

In closing I would like to make a few commentsabout the R Community

I have felt fortunate to be a useR and I also havefelt strongly that I should give back to the Commu-nity - recognizing that ultimately through MedAna-lytics and the use of R I am making a living usingsoftware that has been made available through vol-untary efforts and which I did not need to purchase

To that end over the past three years I have com-mitted to contributing and giving back to the Com-munity in several ways First and foremost by re-sponding to queries on the r-help list and later onthe r-devel list There has been and will continueto be a continuous stream of new useRs who needbasic assistance As current userRs progress in expe-rience each of us can contribute to the Communityby assisting new useRs In this way a cycle developsin which the students evolve into teachers

Secondly in keeping with the philosophy thatrdquouseRs become developersrdquo as I have found theneed for particular features and have developed spe-cific solutions I have made them available via CRANor the lists Two functions in particular have beenavailable in the rdquogregmiscrdquo package since mid-2002thanks to Greg Warnes These are the barplot2()and CrossTable() functions As time permits I planto enhance most notably the CrossTable() functionwith a variety of non-parametric correlation mea-sures and will make those available when ready

Thirdly in a different area I have committed tofinancially contributing back to the Community bysupporting the R Foundation and the recent useR2004 meeting This is distinct from volunteering mytime and recognizes that ultimately even a volun-tary Community such as this one has some oper-ational costs to cover When in 2000-2001 I wascontemplating the purchase of S-PLUS and Statathese would have cost approximately (US)$2500 and(US)$1200 respectively In effect I have committedthose dollars to supporting the R Foundation at theBenefactor level over the coming years

I would challenge and urge other commercialuseRs to contribute to the R Foundation at whateverlevel makes sense for your organization

Thanks

It has been my pleasure and privilege to be a mem-ber of this Community over the past three years Ithas been nothing short of phenomenal to experiencethe growth of this Community during that time

I would like to thank Doug Bates for his kind in-vitation to contribute this article the idea for whicharose in a discussion we had during the recent useRmeeting in Vienna I would also like to extend mythanks to R Core and to the R Community at large

R News ISSN 1609-3631

Vol 41 June 2004 5

for their continued drive to create and support thiswonderful vehicle for international collaboration

Marc SchwartzMedAnalytics Inc Minneapolis Minnesota USAMSchwartzMedAnalyticscom

The ade4 package - I One-table methodsby Daniel Chessel Anne B Dufour and Jean Thioulouse

Introduction

This paper is a short summary of the main classesdefined in the ade4 package for one table analysismethods (eg principal component analysis) Otherpapers will detail the classes defined in ade4 for two-tables coupling methods (such as canonical corre-spondence analysis redundancy analysis and co-inertia analysis) for methods dealing with K-tablesanalysis (ie three-ways tables) and for graphicalmethods

This package is a complete rewrite of the ADE4software (Thioulouse et al (1997) httppbiluniv-lyon1frADE-4) for the R environment It containsData Analysis functions to analyse Ecological andEnvironmental data in the framework of EuclideanExploratory methods hence the name ade4 (ie 4 isnot a version number but means that there are four Ein the acronym)

The ade4 package is available in CRAN but itcan also be used directly online thanks to the Rwebsystem (httppbiluniv-lyon1frRweb) Thispossibility is being used to provide multivariateanalysis services in the field of bioinformatics par-ticularly for sequence and genome structure analysisat the PBIL (httppbiluniv-lyon1fr) An ex-ample of these services is the automated analysis ofthe codon usage of a set of DNA sequences by corre-spondence analysis (httppbiluniv-lyon1frmvacoaphp)

The duality diagram class

The basic tool in ade4 is the duality diagram Es-coufier (1987) A duality diagram is simply a list thatcontains a triplet (X Q D)

- X is a table with n rows and p columns consid-ered as p points in Rn (column vectors) or n points inRp (row vectors)

- Q is a p times p diagonal matrix containing theweights of the p columns of X and used as a scalarproduct in Rp (Q is stored under the form of a vectorof length p)

- D is a n times n diagonal matrix containing theweights of the n rows of X and used as a scalar prod-uct in Rn (D is stored under the form of a vector oflength n)

For example if X is a table containing normalizedquantitative variables if Q is the identity matrix Ip

and if D is equal to 1n In the triplet corresponds to a

principal component analysis on correlation matrix(normed PCA) Each basic method corresponds toa particular triplet (see table 1) but more complexmethods can also be represented by their duality di-agram

Functions Analyses Notesdudipca principal component 1dudicoa correspondence 2dudiacm multiple correspondence 3dudifca fuzzy correspondence 4dudimix analysis of a mixture of

numeric and factors5

dudinsc non symmetric corre-spondence

6

dudidec decentered correspon-dence

7

The dudi functions 1 Principal component analysissame as prcompprincomp 2 Correspondence analysisGreenacre (1984) 3 Multiple correspondence analysisTenenhaus and Young (1985) 4 Fuzzy correspondenceanalysis Chevenet et al (1994) 5 Analysis of a mixtureof numeric variables and factors Hill and Smith (1976)Kiers (1994) 6 Non symmetric correspondence analysisKroonenberg and Lombardo (1999) 7 Decentered corre-spondence analysis Doleacutedec et al (1995)

The singular value decomposition of a tripletgives principal axes principal components and rowand column coordinates which are added to thetriplet for later use

We can use for example a well-known datasetfrom the base package

gt data(USArrests)

gt pca1 lt- dudipca(USArrestsscannf=FALSEnf=3)

scannf = FALSE means that the number of prin-cipal components that will be used to compute rowand column coordinates should not be asked interac-tively to the user but taken as the value of argumentnf (by default nf = 2) Other parameters allow tochoose between centered normed or raw PCA (de-fault is centered and normed) and to set arbitraryrow and column weights The pca1 object is a du-ality diagram ie a list made of several vectors anddataframes

R News ISSN 1609-3631

Vol 41 June 2004 6

gt pca1

Duality diagramm

class pca dudi

$call dudipca(df = USArrests scannf = FALSE nf=3)

$nf 3 axis-components saved

$rank 4

eigen values 248 09898 03566 01734

vector length mode content

1 $cw 4 numeric column weights

2 $lw 50 numeric row weights

3 $eig 4 numeric eigen values

dataframe nrow ncol content

1 $tab 50 4 modified array

2 $li 50 3 row coordinates

3 $l1 50 3 row normed scores

4 $co 4 3 column coordinates

5 $c1 4 3 column normed scores

other elements cent norm

pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)

The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs

gt score(pca1)

gt scorcircle(pca1$co)

minus2 minus1 0 1 2

510

15

Murder

minus2 minus1 0 1 2

5010

015

020

025

030

0

Assault

minus2 minus1 0 1 2

3040

5060

7080

90 UrbanPop

minus2 minus1 0 1 2

1020

3040

Rape

Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations

Murder

Assault

UrbanPop

Rape

Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components

The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)

gt scatter(pca1)

R News ISSN 1609-3631

Vol 41 June 2004 7

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Murder

Assault

UrbanPop

Rape

Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph

Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions

gt slabel(pca1$li)

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Figure 4 Individuals factor map in a normed PCA

Distance matrices

A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)

The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances

gt data(yanomama)

gt gen lt- quasieuclid(asdist(yanomama$gen))

gt geo lt- quasieuclid(asdist(yanomama$geo))

gt ant lt- quasieuclid(asdist(yanomama$ant))

gt geo1 lt- dudipco(geo scann = FALSE nf = 3)

gt gen1 lt- dudipco(gen scann = FALSE nf = 3)

gt ant1 lt- dudipco(ant scann = FALSE nf = 3)

gt par(mfrow=c(22))

gt scatter(geo1)

gt scatter(gen1)

gt scatter(ant1posi=bottom)

d = 100

1 2

3

4 5 6 7

8 9

10

11 12 13

14

15

16 17 18

19

d = 20

1

2

3

4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

19

d = 100

1

2 3

4

5

6

7

8

9 10

11

12 13

14 15

16

17 18

19

Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances

R News ISSN 1609-3631

Vol 41 June 2004 8

Taking into account groups of indi-viduals

In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)

gt data(meaudret)

gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)

gt bet1lt-between(coa1meaudret$plan$sta

+ scannf=FALSE)

gt plot(bet1)

The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object

d = 1

Canonical weights

d = 1

Eda

Bsp

Brh

Bni

Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Canonical weights

Variables

Eda

Bsp

Brh

Bni Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Variables

00

000

501

001

5

Eigenvalues

d = 05

Scores and classes

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

S1

S2

S3

S4

S5

Axis1 Axis2

Inertia axes

d = 02

Classes

S1

S2

S3

S4

S5

Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)

Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus

is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function

Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1

w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))

Permutation tests

Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R

R News ISSN 1609-3631

Vol 41 June 2004 9

version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons

The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values

gt test1lt-randtestbetween(bet1)

gt test1

Monte-Carlo test

Observation 04292

Call randtestbetween(xtest = bet1)

Based on 999 replicates

Simulated p-value 0001

gt plot(test1main=Between class inertia)

Between class inertia

sim

Fre

quency

01 02 03 04 05

0100

200

300

Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram

Conclusion

We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)

We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))

The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that

try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space

Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development

Bibliography

Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107

Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8

Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5

Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5

Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9

Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5

Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7

Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7

Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5

Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5

R News ISSN 1609-3631

Vol 41 June 2004 10

Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125

Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5

Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9

Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7

Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9

Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7

Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7

Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399

Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate

method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8

Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8

Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5

Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5

Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9

Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8

Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8

Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8

R News ISSN 1609-3631

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 4: R News 4(1)

Vol 41 June 2004 4

this transition essentially transparent for the applica-tion software I also started using ESS as my primaryworking environment for R and enjoyed its higherlevel of support for R compared to the editor that Ihad used under Windows which simply providedsyntax highlighting

My transformation to Linux occurred graduallyduring the later Red Hat (RH) 80 beta releases as Ibecame more comfortable with Linux and with theprocess of replacing Windows functionality that hadbecome rdquosecond naturerdquo over my years of MS op-erating system use (dating back to the original IBMPCDOS days of 1981) For example I needed to re-place my practice in R of generating WMF files to beincorporated into Word and Powerpoint documentsAt first I switched to generating EPS files for incorpo-ration into OpenOfficeorgrsquos (OOorg) Writer and Im-press documents but now I primarily use LATEX andthe rdquoseminarrdquo slide creation package frequently inconjunction with Sweave Rather than using Power-point or Impress for presentations I create landscapeoriented PDF files and display them with Adobe Ac-robat Reader which supports a full screen presenta-tion mode

I also use OOorgrsquos Writer for most documentsthat I need to exchange with clients For the finalversion I export a PDF file from Writer knowingthat clients will have access to Adobersquos Reader Be-cause OOorgrsquos Writer supports MS Word input for-mats it provides for interchange with clients whenjointly editable documents (or tracking of changes)are required When clients send me Excel worksheetsI use OOorgrsquos Calc which supports Excel data for-mats (and in fact exports much better CSV files fromthese worksheets than does Excel) I have replacedOutlook with Ximianrsquos Evolution for my e-mail con-tact list and calendar management including the co-ordination of electronic meeting requests with busi-ness partners and clients and of course synching mySony Clie (which Sony recently announced will soonbe deprecated in favor of smarter cell phones)

With respect to Linux I am now using FedoraCore (FC) 2 having moved from RH 80 to RH 9 toFC 1 I have in time replaced all Windows-basedapplications with Linux-based alternatives with oneexception and that is some accounting software that Ineed in order to exchange electronic files with my ac-countant In time I suspect that this final hurdle willbe removed and with it the need to use Windowsat all I should make it clear that this is not a goalfor philosophic reasons but for business reasons Ihave found that the stability and functionality of theLinux environment has enabled me to engage in a va-riety of technical and business process related activi-ties that would either be problematic or prohibitivelyexpensive under Windows This enables me to real-ize additional productivity gains that further reduceoperating costs

Giving Back

In closing I would like to make a few commentsabout the R Community

I have felt fortunate to be a useR and I also havefelt strongly that I should give back to the Commu-nity - recognizing that ultimately through MedAna-lytics and the use of R I am making a living usingsoftware that has been made available through vol-untary efforts and which I did not need to purchase

To that end over the past three years I have com-mitted to contributing and giving back to the Com-munity in several ways First and foremost by re-sponding to queries on the r-help list and later onthe r-devel list There has been and will continueto be a continuous stream of new useRs who needbasic assistance As current userRs progress in expe-rience each of us can contribute to the Communityby assisting new useRs In this way a cycle developsin which the students evolve into teachers

Secondly in keeping with the philosophy thatrdquouseRs become developersrdquo as I have found theneed for particular features and have developed spe-cific solutions I have made them available via CRANor the lists Two functions in particular have beenavailable in the rdquogregmiscrdquo package since mid-2002thanks to Greg Warnes These are the barplot2()and CrossTable() functions As time permits I planto enhance most notably the CrossTable() functionwith a variety of non-parametric correlation mea-sures and will make those available when ready

Thirdly in a different area I have committed tofinancially contributing back to the Community bysupporting the R Foundation and the recent useR2004 meeting This is distinct from volunteering mytime and recognizes that ultimately even a volun-tary Community such as this one has some oper-ational costs to cover When in 2000-2001 I wascontemplating the purchase of S-PLUS and Statathese would have cost approximately (US)$2500 and(US)$1200 respectively In effect I have committedthose dollars to supporting the R Foundation at theBenefactor level over the coming years

I would challenge and urge other commercialuseRs to contribute to the R Foundation at whateverlevel makes sense for your organization

Thanks

It has been my pleasure and privilege to be a mem-ber of this Community over the past three years Ithas been nothing short of phenomenal to experiencethe growth of this Community during that time

I would like to thank Doug Bates for his kind in-vitation to contribute this article the idea for whicharose in a discussion we had during the recent useRmeeting in Vienna I would also like to extend mythanks to R Core and to the R Community at large

R News ISSN 1609-3631

Vol 41 June 2004 5

for their continued drive to create and support thiswonderful vehicle for international collaboration

Marc SchwartzMedAnalytics Inc Minneapolis Minnesota USAMSchwartzMedAnalyticscom

The ade4 package - I One-table methodsby Daniel Chessel Anne B Dufour and Jean Thioulouse

Introduction

This paper is a short summary of the main classesdefined in the ade4 package for one table analysismethods (eg principal component analysis) Otherpapers will detail the classes defined in ade4 for two-tables coupling methods (such as canonical corre-spondence analysis redundancy analysis and co-inertia analysis) for methods dealing with K-tablesanalysis (ie three-ways tables) and for graphicalmethods

This package is a complete rewrite of the ADE4software (Thioulouse et al (1997) httppbiluniv-lyon1frADE-4) for the R environment It containsData Analysis functions to analyse Ecological andEnvironmental data in the framework of EuclideanExploratory methods hence the name ade4 (ie 4 isnot a version number but means that there are four Ein the acronym)

The ade4 package is available in CRAN but itcan also be used directly online thanks to the Rwebsystem (httppbiluniv-lyon1frRweb) Thispossibility is being used to provide multivariateanalysis services in the field of bioinformatics par-ticularly for sequence and genome structure analysisat the PBIL (httppbiluniv-lyon1fr) An ex-ample of these services is the automated analysis ofthe codon usage of a set of DNA sequences by corre-spondence analysis (httppbiluniv-lyon1frmvacoaphp)

The duality diagram class

The basic tool in ade4 is the duality diagram Es-coufier (1987) A duality diagram is simply a list thatcontains a triplet (X Q D)

- X is a table with n rows and p columns consid-ered as p points in Rn (column vectors) or n points inRp (row vectors)

- Q is a p times p diagonal matrix containing theweights of the p columns of X and used as a scalarproduct in Rp (Q is stored under the form of a vectorof length p)

- D is a n times n diagonal matrix containing theweights of the n rows of X and used as a scalar prod-uct in Rn (D is stored under the form of a vector oflength n)

For example if X is a table containing normalizedquantitative variables if Q is the identity matrix Ip

and if D is equal to 1n In the triplet corresponds to a

principal component analysis on correlation matrix(normed PCA) Each basic method corresponds toa particular triplet (see table 1) but more complexmethods can also be represented by their duality di-agram

Functions Analyses Notesdudipca principal component 1dudicoa correspondence 2dudiacm multiple correspondence 3dudifca fuzzy correspondence 4dudimix analysis of a mixture of

numeric and factors5

dudinsc non symmetric corre-spondence

6

dudidec decentered correspon-dence

7

The dudi functions 1 Principal component analysissame as prcompprincomp 2 Correspondence analysisGreenacre (1984) 3 Multiple correspondence analysisTenenhaus and Young (1985) 4 Fuzzy correspondenceanalysis Chevenet et al (1994) 5 Analysis of a mixtureof numeric variables and factors Hill and Smith (1976)Kiers (1994) 6 Non symmetric correspondence analysisKroonenberg and Lombardo (1999) 7 Decentered corre-spondence analysis Doleacutedec et al (1995)

The singular value decomposition of a tripletgives principal axes principal components and rowand column coordinates which are added to thetriplet for later use

We can use for example a well-known datasetfrom the base package

gt data(USArrests)

gt pca1 lt- dudipca(USArrestsscannf=FALSEnf=3)

scannf = FALSE means that the number of prin-cipal components that will be used to compute rowand column coordinates should not be asked interac-tively to the user but taken as the value of argumentnf (by default nf = 2) Other parameters allow tochoose between centered normed or raw PCA (de-fault is centered and normed) and to set arbitraryrow and column weights The pca1 object is a du-ality diagram ie a list made of several vectors anddataframes

R News ISSN 1609-3631

Vol 41 June 2004 6

gt pca1

Duality diagramm

class pca dudi

$call dudipca(df = USArrests scannf = FALSE nf=3)

$nf 3 axis-components saved

$rank 4

eigen values 248 09898 03566 01734

vector length mode content

1 $cw 4 numeric column weights

2 $lw 50 numeric row weights

3 $eig 4 numeric eigen values

dataframe nrow ncol content

1 $tab 50 4 modified array

2 $li 50 3 row coordinates

3 $l1 50 3 row normed scores

4 $co 4 3 column coordinates

5 $c1 4 3 column normed scores

other elements cent norm

pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)

The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs

gt score(pca1)

gt scorcircle(pca1$co)

minus2 minus1 0 1 2

510

15

Murder

minus2 minus1 0 1 2

5010

015

020

025

030

0

Assault

minus2 minus1 0 1 2

3040

5060

7080

90 UrbanPop

minus2 minus1 0 1 2

1020

3040

Rape

Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations

Murder

Assault

UrbanPop

Rape

Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components

The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)

gt scatter(pca1)

R News ISSN 1609-3631

Vol 41 June 2004 7

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Murder

Assault

UrbanPop

Rape

Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph

Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions

gt slabel(pca1$li)

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Figure 4 Individuals factor map in a normed PCA

Distance matrices

A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)

The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances

gt data(yanomama)

gt gen lt- quasieuclid(asdist(yanomama$gen))

gt geo lt- quasieuclid(asdist(yanomama$geo))

gt ant lt- quasieuclid(asdist(yanomama$ant))

gt geo1 lt- dudipco(geo scann = FALSE nf = 3)

gt gen1 lt- dudipco(gen scann = FALSE nf = 3)

gt ant1 lt- dudipco(ant scann = FALSE nf = 3)

gt par(mfrow=c(22))

gt scatter(geo1)

gt scatter(gen1)

gt scatter(ant1posi=bottom)

d = 100

1 2

3

4 5 6 7

8 9

10

11 12 13

14

15

16 17 18

19

d = 20

1

2

3

4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

19

d = 100

1

2 3

4

5

6

7

8

9 10

11

12 13

14 15

16

17 18

19

Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances

R News ISSN 1609-3631

Vol 41 June 2004 8

Taking into account groups of indi-viduals

In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)

gt data(meaudret)

gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)

gt bet1lt-between(coa1meaudret$plan$sta

+ scannf=FALSE)

gt plot(bet1)

The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object

d = 1

Canonical weights

d = 1

Eda

Bsp

Brh

Bni

Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Canonical weights

Variables

Eda

Bsp

Brh

Bni Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Variables

00

000

501

001

5

Eigenvalues

d = 05

Scores and classes

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

S1

S2

S3

S4

S5

Axis1 Axis2

Inertia axes

d = 02

Classes

S1

S2

S3

S4

S5

Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)

Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus

is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function

Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1

w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))

Permutation tests

Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R

R News ISSN 1609-3631

Vol 41 June 2004 9

version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons

The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values

gt test1lt-randtestbetween(bet1)

gt test1

Monte-Carlo test

Observation 04292

Call randtestbetween(xtest = bet1)

Based on 999 replicates

Simulated p-value 0001

gt plot(test1main=Between class inertia)

Between class inertia

sim

Fre

quency

01 02 03 04 05

0100

200

300

Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram

Conclusion

We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)

We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))

The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that

try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space

Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development

Bibliography

Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107

Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8

Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5

Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5

Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9

Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5

Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7

Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7

Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5

Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5

R News ISSN 1609-3631

Vol 41 June 2004 10

Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125

Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5

Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9

Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7

Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9

Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7

Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7

Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399

Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate

method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8

Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8

Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5

Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5

Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9

Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8

Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8

Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8

R News ISSN 1609-3631

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 5: R News 4(1)

Vol 41 June 2004 5

for their continued drive to create and support thiswonderful vehicle for international collaboration

Marc SchwartzMedAnalytics Inc Minneapolis Minnesota USAMSchwartzMedAnalyticscom

The ade4 package - I One-table methodsby Daniel Chessel Anne B Dufour and Jean Thioulouse

Introduction

This paper is a short summary of the main classesdefined in the ade4 package for one table analysismethods (eg principal component analysis) Otherpapers will detail the classes defined in ade4 for two-tables coupling methods (such as canonical corre-spondence analysis redundancy analysis and co-inertia analysis) for methods dealing with K-tablesanalysis (ie three-ways tables) and for graphicalmethods

This package is a complete rewrite of the ADE4software (Thioulouse et al (1997) httppbiluniv-lyon1frADE-4) for the R environment It containsData Analysis functions to analyse Ecological andEnvironmental data in the framework of EuclideanExploratory methods hence the name ade4 (ie 4 isnot a version number but means that there are four Ein the acronym)

The ade4 package is available in CRAN but itcan also be used directly online thanks to the Rwebsystem (httppbiluniv-lyon1frRweb) Thispossibility is being used to provide multivariateanalysis services in the field of bioinformatics par-ticularly for sequence and genome structure analysisat the PBIL (httppbiluniv-lyon1fr) An ex-ample of these services is the automated analysis ofthe codon usage of a set of DNA sequences by corre-spondence analysis (httppbiluniv-lyon1frmvacoaphp)

The duality diagram class

The basic tool in ade4 is the duality diagram Es-coufier (1987) A duality diagram is simply a list thatcontains a triplet (X Q D)

- X is a table with n rows and p columns consid-ered as p points in Rn (column vectors) or n points inRp (row vectors)

- Q is a p times p diagonal matrix containing theweights of the p columns of X and used as a scalarproduct in Rp (Q is stored under the form of a vectorof length p)

- D is a n times n diagonal matrix containing theweights of the n rows of X and used as a scalar prod-uct in Rn (D is stored under the form of a vector oflength n)

For example if X is a table containing normalizedquantitative variables if Q is the identity matrix Ip

and if D is equal to 1n In the triplet corresponds to a

principal component analysis on correlation matrix(normed PCA) Each basic method corresponds toa particular triplet (see table 1) but more complexmethods can also be represented by their duality di-agram

Functions Analyses Notesdudipca principal component 1dudicoa correspondence 2dudiacm multiple correspondence 3dudifca fuzzy correspondence 4dudimix analysis of a mixture of

numeric and factors5

dudinsc non symmetric corre-spondence

6

dudidec decentered correspon-dence

7

The dudi functions 1 Principal component analysissame as prcompprincomp 2 Correspondence analysisGreenacre (1984) 3 Multiple correspondence analysisTenenhaus and Young (1985) 4 Fuzzy correspondenceanalysis Chevenet et al (1994) 5 Analysis of a mixtureof numeric variables and factors Hill and Smith (1976)Kiers (1994) 6 Non symmetric correspondence analysisKroonenberg and Lombardo (1999) 7 Decentered corre-spondence analysis Doleacutedec et al (1995)

The singular value decomposition of a tripletgives principal axes principal components and rowand column coordinates which are added to thetriplet for later use

We can use for example a well-known datasetfrom the base package

gt data(USArrests)

gt pca1 lt- dudipca(USArrestsscannf=FALSEnf=3)

scannf = FALSE means that the number of prin-cipal components that will be used to compute rowand column coordinates should not be asked interac-tively to the user but taken as the value of argumentnf (by default nf = 2) Other parameters allow tochoose between centered normed or raw PCA (de-fault is centered and normed) and to set arbitraryrow and column weights The pca1 object is a du-ality diagram ie a list made of several vectors anddataframes

R News ISSN 1609-3631

Vol 41 June 2004 6

gt pca1

Duality diagramm

class pca dudi

$call dudipca(df = USArrests scannf = FALSE nf=3)

$nf 3 axis-components saved

$rank 4

eigen values 248 09898 03566 01734

vector length mode content

1 $cw 4 numeric column weights

2 $lw 50 numeric row weights

3 $eig 4 numeric eigen values

dataframe nrow ncol content

1 $tab 50 4 modified array

2 $li 50 3 row coordinates

3 $l1 50 3 row normed scores

4 $co 4 3 column coordinates

5 $c1 4 3 column normed scores

other elements cent norm

pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)

The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs

gt score(pca1)

gt scorcircle(pca1$co)

minus2 minus1 0 1 2

510

15

Murder

minus2 minus1 0 1 2

5010

015

020

025

030

0

Assault

minus2 minus1 0 1 2

3040

5060

7080

90 UrbanPop

minus2 minus1 0 1 2

1020

3040

Rape

Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations

Murder

Assault

UrbanPop

Rape

Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components

The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)

gt scatter(pca1)

R News ISSN 1609-3631

Vol 41 June 2004 7

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Murder

Assault

UrbanPop

Rape

Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph

Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions

gt slabel(pca1$li)

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Figure 4 Individuals factor map in a normed PCA

Distance matrices

A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)

The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances

gt data(yanomama)

gt gen lt- quasieuclid(asdist(yanomama$gen))

gt geo lt- quasieuclid(asdist(yanomama$geo))

gt ant lt- quasieuclid(asdist(yanomama$ant))

gt geo1 lt- dudipco(geo scann = FALSE nf = 3)

gt gen1 lt- dudipco(gen scann = FALSE nf = 3)

gt ant1 lt- dudipco(ant scann = FALSE nf = 3)

gt par(mfrow=c(22))

gt scatter(geo1)

gt scatter(gen1)

gt scatter(ant1posi=bottom)

d = 100

1 2

3

4 5 6 7

8 9

10

11 12 13

14

15

16 17 18

19

d = 20

1

2

3

4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

19

d = 100

1

2 3

4

5

6

7

8

9 10

11

12 13

14 15

16

17 18

19

Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances

R News ISSN 1609-3631

Vol 41 June 2004 8

Taking into account groups of indi-viduals

In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)

gt data(meaudret)

gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)

gt bet1lt-between(coa1meaudret$plan$sta

+ scannf=FALSE)

gt plot(bet1)

The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object

d = 1

Canonical weights

d = 1

Eda

Bsp

Brh

Bni

Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Canonical weights

Variables

Eda

Bsp

Brh

Bni Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Variables

00

000

501

001

5

Eigenvalues

d = 05

Scores and classes

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

S1

S2

S3

S4

S5

Axis1 Axis2

Inertia axes

d = 02

Classes

S1

S2

S3

S4

S5

Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)

Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus

is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function

Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1

w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))

Permutation tests

Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R

R News ISSN 1609-3631

Vol 41 June 2004 9

version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons

The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values

gt test1lt-randtestbetween(bet1)

gt test1

Monte-Carlo test

Observation 04292

Call randtestbetween(xtest = bet1)

Based on 999 replicates

Simulated p-value 0001

gt plot(test1main=Between class inertia)

Between class inertia

sim

Fre

quency

01 02 03 04 05

0100

200

300

Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram

Conclusion

We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)

We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))

The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that

try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space

Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development

Bibliography

Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107

Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8

Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5

Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5

Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9

Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5

Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7

Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7

Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5

Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5

R News ISSN 1609-3631

Vol 41 June 2004 10

Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125

Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5

Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9

Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7

Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9

Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7

Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7

Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399

Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate

method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8

Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8

Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5

Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5

Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9

Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8

Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8

Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8

R News ISSN 1609-3631

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 6: R News 4(1)

Vol 41 June 2004 6

gt pca1

Duality diagramm

class pca dudi

$call dudipca(df = USArrests scannf = FALSE nf=3)

$nf 3 axis-components saved

$rank 4

eigen values 248 09898 03566 01734

vector length mode content

1 $cw 4 numeric column weights

2 $lw 50 numeric row weights

3 $eig 4 numeric eigen values

dataframe nrow ncol content

1 $tab 50 4 modified array

2 $li 50 3 row coordinates

3 $l1 50 3 row normed scores

4 $co 4 3 column coordinates

5 $c1 4 3 column normed scores

other elements cent norm

pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)

The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs

gt score(pca1)

gt scorcircle(pca1$co)

minus2 minus1 0 1 2

510

15

Murder

minus2 minus1 0 1 2

5010

015

020

025

030

0

Assault

minus2 minus1 0 1 2

3040

5060

7080

90 UrbanPop

minus2 minus1 0 1 2

1020

3040

Rape

Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations

Murder

Assault

UrbanPop

Rape

Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components

The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)

gt scatter(pca1)

R News ISSN 1609-3631

Vol 41 June 2004 7

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Murder

Assault

UrbanPop

Rape

Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph

Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions

gt slabel(pca1$li)

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Figure 4 Individuals factor map in a normed PCA

Distance matrices

A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)

The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances

gt data(yanomama)

gt gen lt- quasieuclid(asdist(yanomama$gen))

gt geo lt- quasieuclid(asdist(yanomama$geo))

gt ant lt- quasieuclid(asdist(yanomama$ant))

gt geo1 lt- dudipco(geo scann = FALSE nf = 3)

gt gen1 lt- dudipco(gen scann = FALSE nf = 3)

gt ant1 lt- dudipco(ant scann = FALSE nf = 3)

gt par(mfrow=c(22))

gt scatter(geo1)

gt scatter(gen1)

gt scatter(ant1posi=bottom)

d = 100

1 2

3

4 5 6 7

8 9

10

11 12 13

14

15

16 17 18

19

d = 20

1

2

3

4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

19

d = 100

1

2 3

4

5

6

7

8

9 10

11

12 13

14 15

16

17 18

19

Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances

R News ISSN 1609-3631

Vol 41 June 2004 8

Taking into account groups of indi-viduals

In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)

gt data(meaudret)

gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)

gt bet1lt-between(coa1meaudret$plan$sta

+ scannf=FALSE)

gt plot(bet1)

The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object

d = 1

Canonical weights

d = 1

Eda

Bsp

Brh

Bni

Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Canonical weights

Variables

Eda

Bsp

Brh

Bni Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Variables

00

000

501

001

5

Eigenvalues

d = 05

Scores and classes

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

S1

S2

S3

S4

S5

Axis1 Axis2

Inertia axes

d = 02

Classes

S1

S2

S3

S4

S5

Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)

Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus

is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function

Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1

w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))

Permutation tests

Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R

R News ISSN 1609-3631

Vol 41 June 2004 9

version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons

The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values

gt test1lt-randtestbetween(bet1)

gt test1

Monte-Carlo test

Observation 04292

Call randtestbetween(xtest = bet1)

Based on 999 replicates

Simulated p-value 0001

gt plot(test1main=Between class inertia)

Between class inertia

sim

Fre

quency

01 02 03 04 05

0100

200

300

Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram

Conclusion

We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)

We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))

The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that

try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space

Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development

Bibliography

Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107

Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8

Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5

Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5

Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9

Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5

Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7

Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7

Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5

Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5

R News ISSN 1609-3631

Vol 41 June 2004 10

Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125

Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5

Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9

Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7

Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9

Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7

Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7

Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399

Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate

method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8

Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8

Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5

Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5

Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9

Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8

Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8

Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8

R News ISSN 1609-3631

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 7: R News 4(1)

Vol 41 June 2004 7

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Murder

Assault

UrbanPop

Rape

Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph

Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions

gt slabel(pca1$li)

d = 1

Alabama Alaska

Arizona

Arkansas

California

Colorado Connecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa Kansas

Kentucky Louisiana

Maine Maryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma Oregon Pennsylvania

Rhode Island

South Carolina

South Dakota Tennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

Figure 4 Individuals factor map in a normed PCA

Distance matrices

A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)

The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances

gt data(yanomama)

gt gen lt- quasieuclid(asdist(yanomama$gen))

gt geo lt- quasieuclid(asdist(yanomama$geo))

gt ant lt- quasieuclid(asdist(yanomama$ant))

gt geo1 lt- dudipco(geo scann = FALSE nf = 3)

gt gen1 lt- dudipco(gen scann = FALSE nf = 3)

gt ant1 lt- dudipco(ant scann = FALSE nf = 3)

gt par(mfrow=c(22))

gt scatter(geo1)

gt scatter(gen1)

gt scatter(ant1posi=bottom)

d = 100

1 2

3

4 5 6 7

8 9

10

11 12 13

14

15

16 17 18

19

d = 20

1

2

3

4

5

6

7

8

9

10

11 12

13

14

15

16

17

18

19

d = 100

1

2 3

4

5

6

7

8

9 10

11

12 13

14 15

16

17 18

19

Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances

R News ISSN 1609-3631

Vol 41 June 2004 8

Taking into account groups of indi-viduals

In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)

gt data(meaudret)

gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)

gt bet1lt-between(coa1meaudret$plan$sta

+ scannf=FALSE)

gt plot(bet1)

The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object

d = 1

Canonical weights

d = 1

Eda

Bsp

Brh

Bni

Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Canonical weights

Variables

Eda

Bsp

Brh

Bni Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Variables

00

000

501

001

5

Eigenvalues

d = 05

Scores and classes

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

S1

S2

S3

S4

S5

Axis1 Axis2

Inertia axes

d = 02

Classes

S1

S2

S3

S4

S5

Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)

Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus

is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function

Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1

w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))

Permutation tests

Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R

R News ISSN 1609-3631

Vol 41 June 2004 9

version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons

The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values

gt test1lt-randtestbetween(bet1)

gt test1

Monte-Carlo test

Observation 04292

Call randtestbetween(xtest = bet1)

Based on 999 replicates

Simulated p-value 0001

gt plot(test1main=Between class inertia)

Between class inertia

sim

Fre

quency

01 02 03 04 05

0100

200

300

Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram

Conclusion

We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)

We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))

The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that

try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space

Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development

Bibliography

Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107

Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8

Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5

Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5

Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9

Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5

Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7

Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7

Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5

Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5

R News ISSN 1609-3631

Vol 41 June 2004 10

Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125

Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5

Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9

Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7

Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9

Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7

Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7

Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399

Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate

method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8

Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8

Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5

Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5

Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9

Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8

Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8

Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8

R News ISSN 1609-3631

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 8: R News 4(1)

Vol 41 June 2004 8

Taking into account groups of indi-viduals

In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)

gt data(meaudret)

gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)

gt bet1lt-between(coa1meaudret$plan$sta

+ scannf=FALSE)

gt plot(bet1)

The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object

d = 1

Canonical weights

d = 1

Eda

Bsp

Brh

Bni

Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Canonical weights

Variables

Eda

Bsp

Brh

Bni Bpu

Cen

Ecd

Rhi Hla

Hab

Par

Cae

Eig

Variables

00

000

501

001

5

Eigenvalues

d = 05

Scores and classes

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

S1

S2

S3

S4

S5

Axis1 Axis2

Inertia axes

d = 02

Classes

S1

S2

S3

S4

S5

Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)

Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus

is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function

Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1

w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))

Permutation tests

Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R

R News ISSN 1609-3631

Vol 41 June 2004 9

version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons

The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values

gt test1lt-randtestbetween(bet1)

gt test1

Monte-Carlo test

Observation 04292

Call randtestbetween(xtest = bet1)

Based on 999 replicates

Simulated p-value 0001

gt plot(test1main=Between class inertia)

Between class inertia

sim

Fre

quency

01 02 03 04 05

0100

200

300

Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram

Conclusion

We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)

We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))

The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that

try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space

Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development

Bibliography

Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107

Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8

Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5

Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5

Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9

Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5

Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7

Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7

Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5

Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5

R News ISSN 1609-3631

Vol 41 June 2004 10

Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125

Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5

Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9

Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7

Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9

Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7

Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7

Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399

Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate

method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8

Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8

Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5

Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5

Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9

Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8

Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8

Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8

R News ISSN 1609-3631

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 9: R News 4(1)

Vol 41 June 2004 9

version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons

The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values

gt test1lt-randtestbetween(bet1)

gt test1

Monte-Carlo test

Observation 04292

Call randtestbetween(xtest = bet1)

Based on 999 replicates

Simulated p-value 0001

gt plot(test1main=Between class inertia)

Between class inertia

sim

Fre

quency

01 02 03 04 05

0100

200

300

Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram

Conclusion

We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)

We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))

The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that

try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space

Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development

Bibliography

Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107

Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8

Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5

Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5

Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9

Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5

Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7

Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7

Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5

Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5

R News ISSN 1609-3631

Vol 41 June 2004 10

Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125

Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5

Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9

Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7

Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9

Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7

Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7

Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399

Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate

method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8

Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8

Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5

Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5

Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9

Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8

Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8

Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8

R News ISSN 1609-3631

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 10: R News 4(1)

Vol 41 June 2004 10

Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125

Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5

Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9

Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7

Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9

Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7

Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7

Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399

Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate

method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8

Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8

Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5

Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5

Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9

Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8

Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8

Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8

R News ISSN 1609-3631

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 11: R News 4(1)

Vol 41 June 2004 11

qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca

Introduction

The qcc package for the R statistical environment al-lows to

bull plot Shewhart quality control charts for contin-uous attribute and count data

bull plot Cusum and EWMA charts for continuousdata

bull draw operating characteristic curves

bull perform process capability analyses

bull draw Pareto charts and cause-and-effect dia-grams

I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate

The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy

Creating a qcc object

A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)

Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed

Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess

gt data(pistonrings)

gt attach(pistonrings)

gt dim(pistonrings)

[1] 200 3

gt pistonrings

diameter sample trial

1 74030 1 TRUE

2 74002 1 TRUE

3 74019 1 TRUE

4 73992 1 TRUE

5 74008 1 TRUE

6 73995 2 TRUE

7 73992 2 TRUE

199 74000 40 FALSE

200 74020 40 FALSE

gt diameter lt- qccgroups(diameter sample)

gt dim(diameter)

[1] 40 5

gt diameter

[1] [2] [3] [4] [5]

1 74030 74002 74019 73992 74008

2 73995 73992 74001 74011 74004

40 74010 74005 74029 74000 74020

Using the first 25 samples as training data an X chartcan be obtained as follows

gt obj lt- qcc(diameter[125] type=xbar)

By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)

gt summary(obj)

Call

qcc(data = diameter[125 ] type = xbar)

xbar chart for diameter[125 ]

Summary of group statistics

Min 1st Qu Median Mean 3rd Qu Max

7399 7400 7400 7400 7400 7401

Group sample size 5

Number of groups 25

Center of group statistics 7400118

Standard deviation 0009887547

Control limits

LCL UCL

7398791 7401444

R News ISSN 1609-3631

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 12: R News 4(1)

Vol 41 June 2004 12

type Control charts for variablesxbar X chart Sample means are plotted to

control the mean level of a con-tinuous process variable

xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable

R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable

S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able

type Control charts for attributesp p chart The proportion of nonconform-

ing units is plotted Controllimits are based on the binomialdistribution

np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution

c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion

u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units

Table 1 Shewhart control charts available in the qcc package

R News ISSN 1609-3631

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 13: R News 4(1)

Vol 41 June 2004 13

The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned

The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument

Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve

Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively

xbar Chartfor diameter[125 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9073

995

740

0074

005

740

1074

015

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24

LCL

UCL

Number of groups = 25Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 0Number violating runs = 0

Figure 1 A X chart for samples data

Shewhart control charts

A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows

gt plot(obj)

giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the

sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)

Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample

gt obj lt- qcc(diameter[125] type=xbar

+ newdata=diameter[2640])

plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples

xbar Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up s

umm

ary

stat

istic

s

739

9074

000

740

1074

020

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

LCL

UCL

Calibration data in diameter[125 ] New data in diameter[2640 ]

Number of groups = 40Center = 7400118StdDev = 0009887547

LCL = 7398791UCL = 7401444

Number beyond limits = 3Number violating runs = 1

Figure 2 A X chart with calibration data and newdata samples

If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method

gt plot(obj chartall=FALSE)

Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))

As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points

Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows

gt data(orangejuice)

gt attach(orangejuice)

R News ISSN 1609-3631

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 14: R News 4(1)

Vol 41 June 2004 14

gt orangejuice

sample D size trial

1 1 12 50 TRUE

2 2 15 50 TRUE

3 3 8 50 TRUE

53 53 3 50 FALSE

54 54 5 50 FALSE

gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)

where we use the logical indicator variable trial toselect the first 30 samples as calibration data

Operating characteristic function

An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo

The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo

gt par(mfrow=c(12))

gt occurves(obj)

gt occurves(obj2)

The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot

0 1 2 3 4 5

00

02

04

06

08

10

OC curves for xbar chart

Process shift (stddev)

Pro

b ty

pe II

err

or

n = 5n = 1n = 10n = 15n = 20

00 02 04 06 08 10

00

02

04

06

08

10

OC curves for p chart

p

Pro

b ty

pe II

err

or

Figure 3 Operating characteristics curves

Cusum charts

Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process

The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code

gt cusum(obj)

and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1

Cusum Chartfor diameter[125 ] and diameter[2640 ]

Group

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

minus5

minus3

minus1

13

57

911

1315

17

Cum

ulat

ive

Sum

Abo

ve T

arge

tB

elow

Tar

get

LDB

UDB

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4

Figure 4 Cusum chart

EWMA charts

Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process

The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows

gt ewma(obj)

and the resulting graph is shown in Figure 5

R News ISSN 1609-3631

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 15: R News 4(1)

Vol 41 June 2004 15

EWMA Chartfor diameter[125 ] and diameter[2640 ]

Group

Gro

up S

umm

ary

Sta

tistic

s

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

739

9073

995

740

0074

005

740

1074

015

740

20

Calibration Data in diameter[125 ] New Data in diameter[2640 ]

Number of groups = 40Target = 7400118StdDev = 0009887547

Smoothing parameter = 02Control limits at 3sigma

Figure 5 EWMA chart

Process capability analysis

Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use

gt processcapability(obj

+ speclimits=c(73957405))

Process Capability Analysis

Call

processcapability(object = obj

speclimits = c(7395 7405))

Number of obs = 125 Target = 74

Center = 7400118 LSL = 7395

StdDev = 0009887547 USL = 7405

Capability indices

Value 25 975

Cp 1686 1476 1895

Cp_l 1725 1539 1912

Cp_u 1646 1467 1825

Cp_k 1646 1433 1859

Cpm 1674 1465 1882

ExpltLSL 0 ObsltLSL 0

ExpgtUSL 0 ObsgtUSL 0

The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if

available otherwise the target is set at the middlevalue between the specification limits

The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument

The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)

Process Capability Analysisfor diameter[125 ]

7394 7396 7398 7400 7402 7404 7406

LSL USLTarget

Number of obs = 125Center = 7400118StdDev = 0009887547

Target = 74LSL = 7395USL = 7405

Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167

ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0

Figure 6 Process capability analysis

A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice

bull a X chart

bull a R or S chart (if sample sizes gt 10)

bull a run chart

bull a histogram with specification limits

bull a normal QndashQ plot

bull a capability plot

As an example we may input the following code

gt processcapabilitysixpack(obj

+ speclimits=c(73957405)

+ target= 7401)

Pareto charts and cause-and-effectdiagrams

A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system

R News ISSN 1609-3631

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 16: R News 4(1)

Vol 41 June 2004 16

and thus screen out the less significant factors in ananalysis

A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example

gt defect lt- c(80 27 66 94 33)

gt names(defect) lt-

+ c(price code schedule date

+ supplier code contact num

+ part num)

gt paretochart(defect)

The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))

contact num

price code

supplier code

part num

schedule date

Pareto

Ch

art for d

efect

Frequency

0 50 100 150 200 250 300

0 25 50 75 100

Cumulative Percentage

Figure 7 Pareto chart

A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it

A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8

gt causeandeffect(

+ cause=list(Measurements=c(Microscopes

+ Inspectors)

+ Materials=c(Alloys

+ Suppliers)

+ Personnel=c(Supervisors

+ Operators)

+ Environment=c(Condensation

+ Moisture)

+ Methods=c(Brake Engager

+ Angle)

+ Machines=c(Speed Bits

+ Sockets))

+ effect=Surface Flaws)

CauseminusandminusEffect diagram

Surface Flaws

Measurements

Environment

Materials

Methods

Personnel

Machines

Microscopes

Inspectors

Alloys

Suppliers

Supervisors

Operators

Condensation

Moisture

Brake

Engager

Angle

Speed

Bits

Sockets

Figure 8 Cause-and-effect diagram

Tuning and extending the package

Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)

Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be

R News ISSN 1609-3631

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 17: R News 4(1)

Vol 41 July 8 2004 17

called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart

statspstd lt- function(data sizes)

data lt- asvector(data)

sizes lt- asvector(sizes)

pbar lt- sum(data)sum(sizes)

z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)

list(statistics = z center = 0)

sdpstd lt- function(data sizes) return(1)

limitspstd lt- function(center stddev sizes conf)

if (conf gt= 1) lcl lt- -conf

ucl lt- +conf

else

if (conf gt 0 amp conf lt 1)

nsigmas lt- qnorm(1 - (1 - conf)2)

lcl lt- -nsigmas

ucl lt- +nsigmas

else stop(invalid conf argument)

limits lt- matrix(c(lcl ucl) ncol = 2)

rownames(limits) lt- rep( length = nrow(limits))

colnames(limits) lt- c(LCL UCL)

return(limits)

Then we may source the above code and obtainthe control charts in Figure 9 as follows

set unequal sample sizes

gt n lt- c(rep(505) rep(1005) rep(25 5))

generate randomly the number of successes

gt x lt- rbinom(length(n) n 02)

gt par(mfrow=c(12))

plot the control chart with variable limits

gt qcc(x type=p size=n)

plot the standardized control chart

gt qcc(x type=pstd size=n)

p Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

00

01

02

03

04

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 01931429StdDev = 03947641

LCL is variableUCL is variable

Number beyond limits = 0Number violating runs = 0

pstd Chartfor x

Group

Gro

up s

umm

ary

stat

istic

s

minus3

minus2

minus1

01

23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

LCL

UCL

Number of groups = 15Center = 0StdDev = 1

LCL = minus3UCL = 3

Number beyond limits = 0Number violating runs = 0

Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)

Summary

In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram

References

Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall

Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit

Least Squares Calculations in RTiming different approaches

by Douglas Bates

Introduction

The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -

how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods

Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses

In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-

R News ISSN 1609-3631

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 18: R News 4(1)

Vol 41 July 8 2004 18

matically as

β = arg minβ

y minus Xβ2 (1)

where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is

β =(X primeX

)minus1 X primey (2)

when X has full column rank (ie the columns of Xare linearly independent)

Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable

General principles

A few general principles of numerical linear algebraare

1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b

2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey

3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions

Some timings on a large example

For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well

gt library(Matrix)

gt data(mm y)

gt mmm = as(mm matrix)

gt dim(mmm)

[1] 1850 712

gt sysgctime(naivesol lt- solve(t(mmm)

+ mmm) t(mmm) y)

[1] 364 018 411 000 000

(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)

According to the principles above it should bemore effective to compute

gt sysgctime(cpodsol lt- solve(crossprod(mmm)

+ crossprod(mmm y)))

[1] 063 007 073 000 000

gt allequal(naivesol cpodsol)

[1] TRUE

Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way

gt sysgctime(t(mmm) mmm)

[1] 083 003 086 000 000

because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product

However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward

gt sysgctime(ch lt- chol(crossprod(mmm)))

[1] 048 002 052 000 000

gt sysgctime(cholsol lt- backsolve(ch

+ forwardsolve(ch crossprod(mmm y)

+ upper = TRUE trans = TRUE)))

[1] 012 005 019 000 000

gt allequal(cholsol naivesol)

[1] TRUE

R News ISSN 1609-3631

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 19: R News 4(1)

Vol 41 July 8 2004 19

Least squares calculations with Matrixclasses

The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition

gt mmg = as(mm geMatrix)

gt class(crossprod(mmg))

[1] poMatrixattr(package)[1] Matrix

gt sysgctime(Matsol lt- solve(crossprod(mmg)

+ crossprod(mmg y)))

[1] 046 004 050 000 000

gt allequal(naivesol as(Matsol matrix))

[1] TRUE

Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation

gt xpx = crossprod(mmg)

gt xpy = crossprod(mmg y)

gt sysgctime(solve(xpx xpy))

[1] 008 001 009 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all

gt class(mm)

[1] cscMatrixattr(package)[1] Matrix

gt sysgctime(sparsesol lt- solve(crossprod(mm)

+ crossprod(mm y)))

[1] 003 000 004 000 000

gt allequal(naivesol as(sparsesol matrix))

[1] TRUE

The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times

gt xpx = crossprod(mm)

gt class(xpx)

[1] sscMatrixattr(package)[1] Matrix

gt xpy = crossprod(mm y)

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

gt sysgctime(solve(xpx xpy))

[1] 001 000 001 000 000

Epilogue

Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries

Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication

Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB

Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix

R News ISSN 1609-3631

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 20: R News 4(1)

Vol 41 June 2004 20

product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod

Bibliography

J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19

K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19

R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318

R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19

R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19

DM BatesU of Wisconsin-Madisonbateswiscedu

Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman

Introduction

An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor

Description

Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools

pExplorer

pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user

Figure 1 shows the widget invoked by typingpExplorer(base)

The following tasks can be performed throughthe interface

bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box

bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path

bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath

bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated

R News ISSN 1609-3631

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 21: R News 4(1)

Vol 41 June 2004 21

Figure 1 A snapshot of the widget when pExplorer is called

R News ISSN 1609-3631

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 22: R News 4(1)

Vol 41 June 2004 22

bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox

bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox

bull Clicking the Try examples button invokeseExplorer going to be discussed next

eExplorer

eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2

Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R

vExplorer

vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When

vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package

Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks

bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view

bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored

bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution

bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session

The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk

References

Bibliography

Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22

Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22

R News ISSN 1609-3631

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 23: R News 4(1)

Vol 41 June 2004 23

Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)

R News ISSN 1609-3631

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 24: R News 4(1)

Vol 41 June 2004 24

Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()

R News ISSN 1609-3631

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 25: R News 4(1)

Vol 41 June 2004 25

Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked

R News ISSN 1609-3631

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 26: R News 4(1)

Vol 41 June 2004 26

The survival packageby Thomas Lumley

This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)

Overview

Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen

In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package

Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)

Specifying survival data

In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted

In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial

gt library(survival)gt data(veteran) time from randomisation to death

gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+

time from diagnosis to deathgt with(veteran Surv(diagtime30

diagtime30+timestatus))

[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]

The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent

One and two sample summaries

The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs

Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective

gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)

xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))

gt survdiff(Surv(timestatus)~trt data=veteran)

Callsurvdiff(formula = Surv(time status) ~ trt

data = veteran)

N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394

(O-E)^2Vtrt=1 000823trt=2 000823

Chisq= 0 on 1 degrees of freedom p= 0928

R News ISSN 1609-3631

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 27: R News 4(1)

Vol 41 June 2004 27

00 05 10 15 20 25

020

4060

8010

0

Years since randomisation

s

urvi

ving

StandardNew

Figure 1 Survival distributions for two lung cancertreatments

Proportional hazards models

The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables

Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies

log h(t z) = log h0(t)eβz

Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models

A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial

gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+

log(bili)+log(protime)+age+platelet

data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +

log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)

coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911

age 003544 1036platelet -000128 0999

se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000

Likelihood ratio test=185 on 5 df p=0 n= 312

The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor

gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))

gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili

platelet=plateletage=ageprotime=protime)

data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)

col=purple)

The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model

Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways

0 1000 2000 3000 4000

00

02

04

06

08

10

Figure 2 Observed and predicted survival

The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The

R News ISSN 1609-3631

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 28: R News 4(1)

Vol 41 jfoxmcmasterca 28

coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards

gt coxzph(mayomodel)rho chisq p

edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796

graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])

Time

Bet

a(t)

for

edtr

t

210 1200 2400 3400

minus5

05

10

Figure 3 Testing proportional hazards

There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve

shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that

Additional features

Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function

Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model

More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models

The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

useR 2004The R User Conference

by John FoxMcMaster University Canada

More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements

The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R

A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included

bull Paul Murrell on grid graphics and program-ming

bull Martin Maumlchler on good programming practice

R News ISSN 1609-3631

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 29: R News 4(1)

Vol 41 June 2004 29

bull Luke Tierney on namespaces and byte compi-lation

bull Kurt Hornik on packaging documentationand testing

bull Friedrich Leisch on S4 classes and methods

bull Peter Dalgaard on language interfaces usingCall and External

bull Douglas Bates on multilevel models in R

bull Brian D Ripley on datamining and related top-ics

Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml

An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common

interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions

The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence

Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006

R Help DeskDate and Time Classes in R

Gabor Grothendieck and Thomas Petzoldt

Introduction

R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle

bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)

bull chron Package chron provides dates and timesThere are no time zones or notion of daylight

vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)

bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of

R News ISSN 1609-3631

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 30: R News 4(1)

Vol 41 June 2004 30

the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)

When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application

A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above

Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article

Other Applications

Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above

SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)

Time Series

A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie

equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes

Input

By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1

date col in all numeric format yyyymmdd

df lt- readtable(laketemptxt header = TRUE)

asDate(ascharacter(df$date) Y-m-d)

first two cols in format mmddyy hhmmss

Note asis= in readtable to force character

library(chron)

df lt- readtable(oxygentxt header = TRUE

asis = 12)

chron(df$date df$time)

Avoiding Errors

The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction

With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options

default origin

options(chronorigin=c(month=1day=1year=1970))

if TRUE abbreviates year to 2 digits

options(chronyearabb = TRUE)

function to map 2 digit year to 4 digits

options(chronyearexpand = yearexpand)

if TRUE then year not displayed

options(chronsimplify = FALSE)

For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example

1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata

R News ISSN 1609-3631

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 31: R News 4(1)

Vol 41 June 2004 31

orig lt- chron(010160)

x lt- 09 days since 010160

chron dates with default origin

orig + x

Regarding POSIX classes the user should be awareof the following

bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider

dp lt- seq(Systime() len=10 by=day)

plot(dp 110)

This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming

bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms

bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than

POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred

bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown

Comparison Table

Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes

Bibliography

Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30

James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29

Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30

Gabor Grothendieckggrothendieckmywaycom

Thomas Petzoldtpetzoldtrcsurztu-dresdende

R News ISSN 1609-3631

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 32: R News 4(1)

Vol 41 June 2004 32Date

chron

POSIXct

now

SysDate()

aschron(asPOSIXct(format(Systime())tz=GMT))

Systime()

origin

structure(0class=Date)

chron(0)

structure(0class=c(POSIXtPOSIXct))

xdays

since

origin

structure(floor(x)class=Date)

chron(x)

structure(x246060class=c(POSIXtPOSIXct))

G

MT

days

diff

in

days

dt1-dt2

dc1-dc2

difftime(dp1dp2unit=day)

1

time

zone

difference

dp-asPOSIXct(format(dptz=GMT))

compare

dt1

gtdt2

dc1

gtdc2

dp1

gtdp2

next

day

dt+1

dc+1

seq(dp0length=2by=DSTday)[2]

2

previous

day

dt-1

dc-1

seq(dp0length=2by=-1

DSTday)[2]

3

xdays

since

date

dt0+floor(x)

dc0+

xseq(dp0length=2by=paste(xDSTday)[2]

4

sequence

dts

lt-

seq(dt0length=10by=day)

dcs

lt-

seq(dc0length=10)

dps

lt-

seq(dp0length=10by=DSTday)

plot

plot(dtsrnorm(10))

plot(dcsrnorm(10))

plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))

5

every

2nd

week

seq(dt0length=3by=2

week)

dc0+seq(0length=3by=14)

seq(dp0length=3by=2

week)

first

day

of

month

asDate(format(dtY-m-01))

chron(dc)-monthdayyear(dc)$day+1

asPOSIXct(format(dpY-m-01))

month

which

sorts

asnumeric(format(dtm))

months(dc)

asn

umer

ic(f

orm

at(d

p

m)

)day

of

week

(Sun=0)

asnumeric(format(dtw))

asnumeric(dates(dc)-3)7

asnumeric(format(dpw))

day

of

year

asnumeric(format(dtj))

asnumeric(format(asDate(dc)j))

asnumeric(format(dpj))

mean

each

monthyear

tapply(xformat(dtY-m)mean)

tapply(xformat(asDate(dc)Y-m)mean)

tapply(xformat(dpY-m)mean)

12

monthly

means

tapply(xformat(dtm)mean)

tapply(xmonths(dc)mean)

tapply(xformat(dpm)mean)

Output

yyyy-mm-dd

format(dt)

format(asDate(dates(dc)))

format(dpY-m-d)

mmddyy

format(dtmdy)

format(dates(dc))

format(dpmdy)

Sat

Jul

1

1970

format(dta

b

dY)

format(asDate(dates(dc))a

b

d

Y)

format(dpa

b

d

Y)

Input

1970-10-15

asDate(z)

chron(z

format=y-m-d)

asPOSIXct(z)

10151970

asDate(zmdY)

chron(z)

asPOSIXct(strptime(zmdY))

Conversion

(tz=)

to

Date

asDate(dates(dc))

asDate(format(dp))

to

chron

chron(unclass(dt))

chron(format(dpmdY)format(dpHMS))

6

to

POSIXct

asPOSIXct(format(dt))

asPOSIXct(paste(asDate(dates(dc))times(dc)1))

Conversion

(tz=GMT)

to

Date

asDate(dates(dc))

asDate(dp)

to

chron

chron(unclass(dt))

aschron(dp)

to

POSIXct

asPOSIXct(format(dt)tz=GMT)

asPOSIXct(dc)

Not

esz

isa

vect

orof

char

acte

rsx

isa

vect

orof

num

bers

Var

iabl

esbe

ginn

ing

wit

hdt

are

Dat

eva

riab

les

Var

iabl

esbe

ginn

ing

wit

hdc

are

chro

nva

riab

les

and

vari

able

sbe

ginn

ing

wit

hdp

are

POSI

Xct

vari

able

sV

aria

bles

endi

ngin

0ar

esc

alar

sot

hers

are

vect

ors

Expr

essi

ons

invo

lvin

gch

ron

date

sas

sum

eal

lchr

onop

tion

sar

ese

tatd

efau

ltva

lues

See

strp

tim

efo

rm

ore

co

des

1 Can

bere

duce

dto

dp1-dp2

ifit

sac

cept

able

that

date

tim

escl

oser

than

one

day

are

retu

rned

inho

urs

orot

her

unit

sra

ther

than

days

2 C

anbe

redu

ced

todp+246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

3 C

anbe

redu

ced

todp-246060

ifon

eho

urde

viat

ion

atti

me

chan

ges

betw

een

stan

dard

and

dayl

ight

savi

ngs

tim

ear

eac

cept

able

4 C

anbe

redu

ced

todp+246060x

ifon

eho

urde

viat

ion

acce

ptab

lew

hen

stra

ddlin

god

dnu

mbe

rof

stan

dard

day

light

savi

ngs

tim

ech

ange

s5 C

anbe

redu

ced

toplot(dpsrnorm(10))

ifpl

otti

ngpo

ints

atG

MT

date

tim

esra

ther

than

curr

entd

atet

ime

suffi

ces

6 An

equi

vale

ntal

tern

ativ

eisaschron(asPOSIXct(format(dp)tz=GMT))

Table 1 Comparison Table

R News ISSN 1609-3631

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 33: R News 4(1)

Vol 41 June 2004 33

Programmersrsquo NicheA simple class in S3 and S4

by Thomas Lumley

ROC curves

Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease

The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)

I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes

Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints

drawROC lt-function(TD)

cutpointslt-c(-Inf sort(unique(T)) Inf)

senslt-sapply(cutpoints

function(c) sum(D[Tgtc])sum(D))

speclt-sapply(cutpoints

function(c) sum((1-D)[Tlt=c]sum(1-D)))

plot(1-spec sens type=l)

It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion

There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined

drawROClt-function(TD)

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

plot( mspecsens type=l)

One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers

Classes and methods

Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result

The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec test=TT

call=syscall())

class(rval)lt-ROC

rval

printROClt-function(x)

cat(ROC curve )

print(x$call)

The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck

Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot

plotROClt-function(x type=b nullline=TRUE

xlab=1-Specificity ylab=Sensitivity

main=NULL )

par(pty=s)

plot(x$mspec x$sens type=type

xlab=xlab ylab=ylab )

R News ISSN 1609-3631

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 34: R News 4(1)

Vol 41 June 2004 34

if(nullline)

abline(0 1 lty=3)

if(isnull(main))

mainlt-x$call

title(main=main)

A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method

linesROClt-function(x)

lines(x$mspec x$sens )

A method for identify allows the cutpoint to befound for any point on the curve

identifyROClt-function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(x$testdigits)

identify(x$mspec x$sens labels=labels)

An AUC function to compute the area under theROC curve is left as an exercise for the reader

Generalising the class

The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction

ROC lt- function(T) UseMethod(ROC)

ROCdefaultlt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable

ROCglmlt-function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

testlt-fitted(T)

diseaselt-(test+resid(T type=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously

far from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

rvallt-list(sens=sens mspec=mspec

test=TTcall=syscall())

class(rval)lt-ROC

rval

S4 classes and method

In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)

As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present

The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions

setClass(ROC

representation(sens=numericmspec=numeric

test=numericcall=call)

validity=function(object)

length(objectsens)==length(objectmspec) ampamp

length(objectsens)==length(objecttest)

)

The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1

In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new

R News ISSN 1609-3631

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 35: R News 4(1)

Vol 41 June 2004 35

function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function

ROClt-function(TD)

TTlt-rev(sort(unique(T)))

DDlt-table(-TD)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROC sens=sens mspec=mspec

test=TTcall=syscall())

Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code

setMethod(showROC

function(object)

cat(ROC curve )

print(objectcall)

)

The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R

The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters

setMethod(plot c(ROCmissing)

function(x y type=b nullline=TRUE

xlab=1-Specificity

ylab=Sensitivity

main=NULL )

par(pty=s)

plot(xmspec xsens type=type

xlab=xlab ylab=ylab )

if(nullline)

abline(01lty=3)

if(isnull(main))

mainlt-xcall

title(main=main)

)

Creating a method for lines appears to work thesame way

setMethod(linesROC

function(x)

lines(xmspec xsens)

)

In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before

gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt

and after the setMethod call

gt linesstandardGeneric for lines defined frompackage graphics

function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x

shows what happensNote that the creation of this S4 generic does not

affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods

Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject

setGeneric(ROC)

setMethod(ROCc(glmmissing)

function(T)

if ((T$family$family in

c(binomial quasibinomial)))

stop(ROC curves for binomial glms only)

R News ISSN 1609-3631

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 36: R News 4(1)

Vol 41 June 2004 36

testlt-fitted(T)

diseaselt-(test+resid(Ttype=response))

diseaselt-diseaseweights(T)

if (max(abs(disease 1))gt001)

warning(Y values suspiciously far

from integers)

TTlt-rev(sort(unique(test)))

DDlt-table(-testdisease)

senslt-cumsum(DD[2])sum(DD[2])

mspeclt-cumsum(DD[1])sum(DD[1])

new(ROCsens=sens mspec=mspec

test=TTcall=syscall())

)

Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method

setGeneric(identifysignature=c(x))

setMethod(identify ROC

function(x labels=NULL

digits=1)

if (isnull(labels))

labelslt-round(xtest digits)

identify(xmspec xsens

labels=labels)

)

Discussion

Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods

This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it

Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle

Changes in Rby the R Core Team

New features in 191

bull asDate() now has a method for POSIXlt ob-jects

bull mean() has a method for difftime objects andso summary() works for such objects

bull legend() has a new argument ptcex

bull plotts() has more arguments particularlyyaxflip

bull heatmap() has a new keepdendro argument

bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)

bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae

User-visible changes in 190

bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages

bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices

Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has

R News ISSN 1609-3631

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 37: R News 4(1)

Vol 41 June 2004 37

been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils

Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub

Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook

bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181

bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings

New features in 190

bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE

bull There are now print() and [ methods for acfobjects

bull aov() will now handle singular Error() mod-els with a warning

bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)

bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S

bull asdataframe() now has a method for ar-rays

bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix

bull New function assignInNamespace() paral-lelling fixInNamespace

bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant

bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)

bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string

bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date

bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)

bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages

bull The print() method for difftime objects nowhandles arrays

bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted

bull dist() has a new method to calculateMinkowski distances

bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array

bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae

bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)

bull findInterval(x v) now allows +-Inf val-ues and NAs in x

bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented

bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights

R News ISSN 1609-3631

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 38: R News 4(1)

Vol 41 June 2004 38

bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error

The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale

bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)

bull legend() has a new argument rsquotextcolrsquo

bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)

bull A new function mget() will retrieve multiplevalues from an environment

bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr

bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object

bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone

bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)

It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted

bull ppr() now fully supports categorical explana-tory variables

ppr() is now interruptible at suitable places inthe underlying FORTRAN code

bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known

bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0

bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail

bull readtable() now allows embedded newlinesin quoted fields (PR4555)

bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S

If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility

If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)

The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)

bull rgb2hsv() is new an R interface the C APIfunction with the same name

bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook

bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present

bull New function shQuote() to quote strings to bepassed to OS shells

bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection

bull splitscreen() now works for multiple de-vices at once

bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion

bull strsplit() now has fixed and perl argu-ments and split= is optimized

bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes

bull termplot() has an option to smooth the partialresiduals

bull varimax() and promax() add class loadingsto their loadings component

R News ISSN 1609-3631

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 39: R News 4(1)

Vol 41 June 2004 39

bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)

bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000

bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts

bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)

bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled

bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()

bull Package stats4 exports S4 generics for AIC()and BIC()

bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able

bull Added experimental support for conditionalsin NAMESPACE files

bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)

bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)

bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally

bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in

the call the defaults in the selected methodwill override a default in the generic SeesetMethod

bull Changes to package rsquogridrsquo

ndash Renamed pushpopviewport() topushpopViewport()

ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)

ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call

ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)

currentvpTree() returns the currentviewport tree

ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()

See viewports for an example of its use

NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)

ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)

ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5

name=A))gridrect()pushViewport(viewport(w=5 h=5

name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A

gp=gpar(fill=red))gridrect(vp=vpPath(A B)

gp=gpar(fill=blue))

R News ISSN 1609-3631

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 40: R News 4(1)

Vol 41 June 2004 40

ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml

ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary

grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)

gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)

the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified

In addition the following features arenow possible with grobs

grobs now save()load() like anynormal R object

many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)

there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped

before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot

there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree

there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version

ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)

ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now

on clip to the viewport (was TRUE)inherit clip to whatever parent says

(was FALSE)off turn off clipping

Still accept logical values (and NA mapsto off)

bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same

bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell

bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file

bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package

bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes

R News ISSN 1609-3631

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 41: R News 4(1)

Vol 41 June 2004 41

bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies

bull Vignette dependencies can now be checkedand obtained via vignetteDepends

bull Option repositories to list URLs for pack-age repositories added

bull packagedescription() has been replaced bypackageDescription()

bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories

bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in

bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code

bull hsv2rgb and rgb2hsv are newly in the C API

bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms

bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange

bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments

bull printcoefmat() is defunct replaced byprintCoefmat()

bull codes() and codeslt-() are defunct

bull anovalistlm (replaced in 120) is now de-funct

bull glmfitnull() lmfitnull() andlmwfitnull() are defunct

bull printatomic() is defunct

bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)

bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method

bull Laeigen() is deprecated now eigen() usesLAPACK by default

bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)

bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120

bull packagecontents() and packagedescription()have been deprecated

bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre

The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used

The included zlib sources have been updated to121 and this is now required if --with-zlibis used

bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170

bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work

The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes

Changes on CRANby Kurt Hornik

New contributed packages

AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-

signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler

BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth

BsMD Bayes screening and model discrimination

R News ISSN 1609-3631

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 42: R News 4(1)

Vol 41 June 2004 42

follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code

DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez

GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer

HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel

Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal

MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn

MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk

NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee

R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges

RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer

RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig

SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton

SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe

SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi

assist ASSIST see manual By Yuedong Wang andChunlei Ke

bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen

betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas

circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli

crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check

R News ISSN 1609-3631

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 43: R News 4(1)

Vol 41 June 2004 43

given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer

debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington

distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla

dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg

energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely

evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson

evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson

fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file

fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file

fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file

fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright

gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team

gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley

gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding

hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham

haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid

hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor

httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski

impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu

R News ISSN 1609-3631

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 44: R News 4(1)

Vol 41 June 2004 44

kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik

klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges

knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey

kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko

mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa

mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou

mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes

mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner

mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski

multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon

mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington

mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath

pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger

qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca

race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari

rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)

ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel

regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh

rgl 3D visualization device (OpenGL) By DanielAdler

R News ISSN 1609-3631

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 45: R News 4(1)

Vol 41 June 2004 45

rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini

rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers

rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov

sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis

seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld

setRNG Set reproducible random number generatorin R and S By Paul Gilbert

sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers

skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son

spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford

spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth

spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha

supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler

svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie

treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto

urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff

urn Functions for sampling without replacement(Simulated Urns) By Micah Altman

verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou

zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck

Other changes

bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive

Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg

R News ISSN 1609-3631

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members
Page 46: R News 4(1)

Vol 41 June 2004 46

R Foundation Newsby Bettina Gruumln

Donations

bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)

bull Emanuele De Rinaldis (Italy)

bull Norm Phillips (Canada)

New supporting institutions

bull Breast Center at Baylor College of MedicineHouston Texas USA

bull Dana-Farber Cancer Institute Boston USA

bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA

bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA

bull Division of Biostatistics University of Califor-nia Berkeley USA

bull Norwegian Institute of Marine ResearchBergen Norway

bull TERRA Lab University of Regina - Depart-ment of Geography Canada

bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand

New supporting members

Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)

Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat

Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA

Editorial BoardDouglas Bates and Paul Murrell

Editor Programmerrsquos NicheBill Venables

Editor Help DeskUwe Ligges

Email of editors and editorial boardfirstnamelastname R-projectorg

R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)

R Project HomepagehttpwwwR-projectorg

This newsletter is available online athttpCRANR-projectorgdocRnews

R News ISSN 1609-3631

  • Editorial
  • The Decision To Use R
    • Preface
    • Some Background on the Company
    • Evaluating The Tools of the Trade - A Value Based Equation
    • An Introduction To R
    • A Continued Evolution
    • Giving Back
    • Thanks
      • The ade4 package - I One-table methods
        • Introduction
        • The duality diagram class
        • Distance matrices
        • Taking into account groups of individuals
        • Permutation tests
        • Conclusion
          • qcc An R package for quality control charting and statistical process control
            • Introduction
            • Creating a qcc object
            • Shewhart control charts
            • Operating characteristic function
            • Cusum charts
            • EWMA charts
            • Process capability analysis
            • Pareto charts and cause-and-effect diagrams
            • Tuning and extending the package
            • Summary
            • References
              • Least Squares Calculations in R
              • Tools for interactively exploring R packages
                • References
                  • The survival package
                    • Overview
                    • Specifying survival data
                    • One and two sample summaries
                    • Proportional hazards models
                    • Additional features
                      • useR 2004
                      • R Help Desk
                        • Introduction
                        • Other Applications
                        • Time Series
                        • Input
                        • Avoiding Errors
                        • Comparison Table
                          • Programmers Niche
                            • ROC curves
                            • Classes and methods
                              • Generalising the class
                                • S4 classes and method
                                • Discussion
                                  • Changes in R
                                    • New features in 191
                                    • User-visible changes in 190
                                    • New features in 190
                                      • Changes on CRAN
                                        • New contributed packages
                                        • Other changes
                                          • R Foundation News
                                            • Donations
                                            • New supporting institutions
                                            • New supporting members